


Soy a 
cg The I ournal of 


Mésational Psychology 


Devoted Primarily to the Scientific Study of Problems of 
Learning and Teaching. 























BOARD OF EDITORS: 


HAROLD ORDWAY RUGG, Chairman. RUDOLF PINTNER, 
Lincoln School of Teachers College. Teachers College, Columbia University. 
Teachers College, Columbia University. 
BEARDSLEY RUML, 


Carnegie Corporation, New York City. 


LEWIS MADISON TERMAN, 
Leland Stanford University. 


EDWARD LEE THORNDIKE, 
Teachers College, Columbia Universiiy. 





JAMES CARLETON BELL, 
Brooklyn Training School for Teachers. 


FRANK NUGENT FREEMAN, 
University of Chicago. 

ARTHUR IRVING GATES, 
Teachere College, Columbia University. 


VIVIAN ALLEN CHARLES HENMON, 
University of Wisconsin. 


LAURA ZIRBES Assistant Editor. 
Lincoln School, of Teachers College. 


$4.00 a year, 
60c. per copy. 


FEBRUARY, 1923 








CONTENTS 





A SEconpD APPROXIMATION TO THE CURVE OF THE DISTRIBUTION 
OF INTELLIGENCE OF THE POPULATION OF THE UNITED 
STATES, WITH A NOTE ON THE STANDARDIZATION OF THE 
STANFORD REVISION OF THE BINET-SIMON ScALeE. Percival 
8” Fee... eer eee PC 


THe Grapuic Ratina ScateE. MazxFreyd.......... 88 


THE UNRELIABILITY OF THE DIFFERENCE BETWEEN INTELLIGENCE i} 


AND EpucaTionaL Ratines. J.Crosby Chapman. ... . 103 


Is Ir NecEssARY TO WEIGHT EXERCISES IN STANDARD TEsTs? 
Harl R. Douglas and Peter L. Spencer.~ ........ =. 109 


RETENTION AFTER LonG Periops. Dean A. Worcester oe ets 
Dritt in Antrumetic. F. B. Knight... .......=. 2.2135 
A CoMPARISON OF THE EssAY AND THE OBJECTIVE Sein OF 
Examinations. Donald A.DTaird........... . 128 
Notes on ARTICLES IN EDUCATIONAL PsyCHOLOGY IN CURRENT 
Issues oF OTHER MaGAZINES......... 9 3g eine 





New PvuBLIcATIONS IN EDUCATIONAL PsycHOLOGY AND RELATED | 


a - gage ee he 





Published Monthly Except June to August by 
WARWICE and YORK, Inc., 
York, Pa. Baltimore, Md. 











——— 





i as Second Class matter November 15, 1921, at the Post Office at York, Pennsylvania, under the Act of March 3, 1879. 










aed tuple 
pat 


Pap ae Lg Lx 
< = 
ie 


fi 





atk we 
z tee ae 





| 
; 
’ 
| 
' 
: 
i 
































The Journal of 


Educational Psychology HEADQUARTERS 











is published monthly, by Warwick & York: | 
Inc., Baltimore, Md., and York, Pa. Theissues [ff 
for 1922 make Volume XIII. Title page and | 
index are bound in December number. 

Manuscripts for publication, books or other | 
materials for review, and news items should be {ff 
addressed to Harold Ordway Rugg, Chairman of | 
m| the Editorial Board, 59 Edgecliffe Terrace, 
i Yonkers, New York. 

The price of the Journal is $4.00 a year, payable 
in advance. Foreign postage is 30 cents extra. | 
Single issues are 60 cents each, and less than a full 
| year’s subscription is charged at single issue rate | 
mM) for every month ordered. 

Subscribers should notify the publishers of 
change in address at least four weeks before 
publication of the issue with which change is to 
} take effect. No claim for non-receipt of an 
i) issue will be entertained unless made within two | 
| weeks after receipt of the next succeeding issue. 


European 
| Plan 


Entirely 
Fire-Proof 


Centrally 
Located 


EDWARD DAVIS 


Warwick & York. Inc., 
Baltimore, Md. 


HOTEL RENNERT 


BALTIMORE, MARYLAND 











RECENT PUBLICATIONS 





Introductory Psychology for Teachers—Revisrp 


By Epwarp K. Strong, Jr. 
520 pages. $2.20. Postage 12¢. 





Brief Introductory Psychology for Teachers. 
By Epwarp K. STRONG, JR. 
wit, 241 pages. $1.60. Postage 10¢. 





The Output of Professional Schools for Teachers. 


By CuHaruss E. BENSON. 
z, 88 pages, $1.50. Postage 10¢. | 





Textbook Selection. 


By R. H. Frazen anp F. B. Kniaurt. 
INTRODUCTION BY ERNEST Horn. 


94 pages. $1.20. Postage 6¢. 





Warwick & York, Inc., Publishers, Baltimore. 














S 
a THE JOURNAL OF 
EDUCATIONAL PSYCHOLOGY 








Volume XIV February, 1923 Number 2 








A SECOND APPROXIMATION TO THE CURVE OF THE 
DISTRIBUTION OF INTELLIGENCE OF THE 
POPULATION OF THE UNITED STATES, 
WITH A NOTE ON THE STANDARDIZA- 


VIS | TION OF THE STANFORD REVISION 
| OF THE BINET-SIMON SCALE 
Tie PERCIVAL M. SYMONDS 
| University of Hawaii 
| The first approximation to the curve of distribution of intelligence 
of the population of the United States is the IQ curve, a normal curve 
— of error obtained in the standardization of the Stanford Revision of 
the Binet-Simon Scale. 
ae The method employed in this paper to obtain a second approxi- 


mation is roughly as follows: . 

Determine the curves of distribution of the 9 occupation groups as 
given in Table XIV on page 53 of Volume IV of the 1910 (thirteenth) 
census of the United States, making the area of these proportional 
to the percentage which they are of the total population (Table VIII, 
page 40, 1910 census), and then determine the cumulative curve of 
distribution by adding the ordinates of the separate curves. 

The steps in the process more precisely are: 

1. Determine averages of medians, Q: and Q; (scores on Army 
Alpha) of the occupations in each occupation group, weighting the 
separate occupations roughly in proportion to the number in the 
group. The data for the medians, Q,; and Q; of the various occupa- 
tions, I take from the table on page 275 in the article by Fryer entitled 
Occupation Intelligence Standards in School and Society, Vol. XVI, 

Sept. 2, 1922, page 401. 


65 








i a ete ae 





66 The Journal of Educational Psychology 


AGRICULTURE, ForRESTRY AND ANIMAL HuUsBANDRY 











Q: | Median; Q; | Weights Population 

NS aos so ne a Sig's oe 40 58 83 59 5,865 ,000 

Farm laborers........... 13 21 47 60 5,975,000 

Fishermen............... 15 20 51 1 68 ,000 

Lumbermen............. 18 35 62 2 161 ,000 
12,069 ,000 out of 
12,659 ,000 

Weighted average........ 26 39 65 

















The standards for farm laborers are taken to be the same as for 


construction laborers given in Fryer’s table. 
me to take independent figures for farm laborers. All through this Le 





I have no data to allow Ph 








paper laborers are given the standards 13, 21, 47. a 
T: 
EXTRACTION OF MINERALS 
Q: | Median; Q; | Weights Population 
ve cm ee we a Meee pe OSA SO : V 
ee 49 71 92 916 ,000 
gb oh ea 59 77 107 2 23 ,000 
Executives..............| 81 109 137 | 3 25 ,000 ; 
964 ,000 out of 
965 , 000 
Weighted average........ 42 51 74 























Intelligence of Population in United States 67 


MANUFACTURING AND MECHANICAL INDUSTRIES 

















Q: | Median! Q; | Weights Population 
Sa os 59 87 1 90 ,000 
Blacksmiths............. 39 61 82 2 241,000 
a ee 19 40 60 2 169 ,000 
Carpenters..............| 44 65 88 8 817,000 
er 60 93 1 128 ,000 
Electricians............. 57 81 109 1 136 ,000 
Stationary engineers..... . 35 55 81 2 231 ,000 
ei a 27 63 1 111,000 
a 61 79 111 2 175,000 
Fy ae 13 21 47 25 2,490,000 
Machinists.............. 40 63 89 5 479 ,000 
Minor executives........ .| 81 109 137 1 104,000 
a ce 59 81 3 334,000 
Plumbers............... 44 66 88 1 148 ,000 
Leathers workers......... 16 30 41 2 215,000 
Textile workers.......... 18 26 60 7 | 651,000 
Shoemakers............. 38 56 76 BI 70 ,000 
, tel as 65 89  } 205 , 000 
| 6 , 794,000 out of 

! 10,659,000 

Weighted average........| 28 42 68 


























Important occupations omitted from the tabulations in this 
group are: 


SIE ee ee 
cielo 4 a ail erirs eae ak ebade ane Rage és) nn 
hod 9 nity wa side Ashe kein Gan ateltes boc sae. > an 
a RE NEE PRA rear yer agree Pema Rete oe | 128 ,000 
sured 5 ae boc wns Os 34-4 ARM hom oc ae ee 257 ,000 
I es ised Man ky bit egghead dees eae 0 > dine . 
NN a5, 0 baa sna ss dewmpes veKtee en 1,576,000 

2,996 ,000 


At first sight it looks as though the average over-neglected the con- 
tractors and manufacturers who are well above the group average 
intelligence. But their combined weighting would be but 4 as against 
16 for the remaining semi-skilled operatives who are probably below 
the group average in intelligence, if we may judge by the two semi- 
skilled occupations that have been included—leather workers and textile 





— ee ee eet OMe 





68 The Journal of Educational Psychology 


workers. On the whole, the weighted averages if anything, are too 
high rather than too low. Probably the 1920 census will give more 
weight to automobile mechanics. 











TRANSPORTATION 
Q: | Median; Q; | Weights Population 

I ick ons 'e's 860 lbs 30 50 72 4 408 ,000 
a 35 55 77 1 63 ,000 
Brakemen................ 41 63 86 1 93 ,000 
Conductors...............| 64 83 106 1 122 ,000 
a 61 79 110 1 70 ,000 
aac ¥ucsccccs ces 13 21 47 8 792 ,000 
Engineers, locomotive... . . 44 61 84 1 96 ,000 
Firemen, locomotive. ..... .| 44 61 84 1 76 ,000 
Telegraph operators.......| 57 85 110 1 70,000 
Telephone operators....... 46 70 95 1 98 ,000 

1,888 ,000 out of 

2,638 ,000 
Weighted average......... 31 46 66 




















The outstanding injustice here is the omission of chauffeurs, 
which would wield a much more important influence on the basis of 
1920 census returns. 


TRADE 








Q: | Median; Q; | Weights Population 





Sales clerks...............) 38 52 96 15 1,472,000 


a 13 21 47 2 183 ,000 
Retail dealers.............| 54 85 110 12 1,195,000 


2,850 ,000 out of 
3,614,000 











Weighted average.........| 42 63 98 











This is almost surely too low omitting as it does bankers, insurance 
agents, and real estate agents, but I have no means of estimating the 
level of these occupations. Their combined weighting would be 3. 
The figures for retail dealers were taken from Yerkes (’21), Vol. 15 





of 1 
the 


s> 


se =. oe 





Intelligence of Population in United States 69 


tir ahem ~ 
: 
- ~ ~ 


of the National Academy of Sciences, Psychological Examination in 
the United States Army, page 825 under the heading of stock-keepers. 


Pusuic SERVICE 











Q: | Median| Q; | Weights Population 
et ie 13 21 47 1 63 ,000 
se aa a 109 137 1 105 ,000 
Policemen..............-.-| 46 69 90 1 62 ,000 
Soldiers and sailors........| 25 44 73 1 77,000 
307 ,000 out of 
459 ,000 
Weighted average......... 41 61 87 




















The standards for soldiers and sailors are taken as a mean 
between sailors 16, 32, 59 and the drafted United States forces in the 
World War 35, 56, 87. 





ne 


PROFESSIONAL SERVICE 





ES AL RN ON OT, 5 ESE ET = 77 








Q: | Median; Q; | Weights Population a 

Actors..... Sole Mh « HW Mets eed 31 62 94 3 28 ,000 

SRE a ie | ey 34,000 a 

ices + 60sec 0s 94/ 119 | 139/ 2 16,000 er 

Civil engineers............ 110 161 183 5 52,000 ie 

ETAT LE 124 152 185 12 118,000 

ras 80; 110 | 128 4 40,000 

bec nthe keene 84 114 139 3 33 ,000 

ics 606 ob eee ens 57 82 108 14 139 ,000 

Photographers............| 59 86 | 107 3 32,000 

yt ES ee 97 122 148 60 599 ,000 

Physicians............... 107 127 164 15 151,000 ; 

Ls ae 99 126 8 82,000 cy! 
1,324,000 out of Pe 
1,664,000 

Weighted average......... 92 118 145 























' 
: 
; 
: 
i 
; 
| 
: 


70 The Journal of Educational Psychology 


DomEstTic AND PERSONAL SERVICE 











Q: | Median; Q; | Weights Population 

ss. hh gees me 34 55 78 2 195 ,000 
TTT Cee 13 21 47 1 53 ,000 
nn... . +> sn as eae 45 66 91 6 645 ,000 
Nurses. . 78 99 126 1 127 ,000 
RE e.g dhs eue al 17 27 57 5 450 ,000 
SR ee 41 57 81 4 380 ,000 

1,850,000 out of 

3,722,000 
Weighted average......... 36 52 78 




















There are many unsatisfactory things in this group. The stan- 
dards for caterer have been given to hotel keepers and managers, 
housekeepers and stewards, lodging and boarding housekeepers, 
restaurant keepers, although the term caterer is used generally in a 


much more restricted sense. 


“eee eee ee eee eeeeeerereeeeeeere ee eeeneeeneeeee 
beOoee2 oC Oe 480884642 O89 07H OH 8446808720802 84H OS SO 
eeeereene eee eee eewr eee eewe eee errr eee eee 
eeoerreer eee ee eer rer ee ee eee eee eee eee ee eeneee 
SBeeeseoeoeoeesoeeoooeoeseeceoooasiseeoeone eo eCoeoseesaeeeeoeee eo eee 


Important occupations omitted are: 


1,642,000 


Probably the group, bartenders and saloonkeepers, will be greatly 


diminished in the 1920 census. 


CLERICAL OCCUPATIONS 











Q: | Median; Q; | Weights Population 
Bookkeepers.............. 77 101 127 5 487 ,000 
Shipping clerks........... 54 78 102 1 81,000 
GI, clit ebb +00) s eles 74 96 121 6 640,000 
Stenographers............ 73 103 124 3 317 ,000 
1,525 ,000 out of 
1,737 ,000 
Weighted average......... 74 99 122 























,ourpe eZee | 





l- 


9 











Intelligence of Population in United States 71 
_ CoLLEecTING THE AVERAGES 

Q: | Median Q:; 
Agriculture, forestry, and animal husbandry.......... 26 39 65 
Extraction of mimorals....... 2.2... ccc ccc cece ceces | 42 51 74 
Manufacturing and mechanical industries............ | 28 42 68 
IAS. 0: bcs ans Seid Abe a eh ba sde 06 ¥s OO: 31 46 66 
ad tn ce divguletn en oi culesgion 6608 «cbkoaer 42 63 98 
I a ESRI SS oe oes Tete Re gee 41 61 87 
I. ons s 40 neo s sanh oveweipay sss ae et, ia 118 145 
Domestic and personal service...................... 36 52 78 
I, go cis cau ch beck ce Celessuedecee 74 99 122 











2. The second step is to find the o of those groups. o« = 0.7412 


(Qs — Q:) on the assumption that the distribution is normal, an 
assumption which I make at this place. 














Per cent | Median o z 
I aha a on win denies cud aceaed | 33.2 39 29 34.2 
EE a. ahs unin shine a 68h 4 oaEEM ene 51 24 | 26.4 
I, Fac cova nse cVeenbedees 27.9 42 30 37.3 
TFrmmaportation..... 0... cece cece cece: 6.9 46 26 34.6 
es os sp ne be a eee 9.5 63 41 45.4 
Io oe 05 oc ncins cs omedencedaees 1.2 61 34 47.1 
RS. Dav cccdd weanetaces «titeh 4.4 118 39 44.7 
ER cs Sn cueesice ccs thes ceectes 9.9 52 31 37.0 
I ods cnc cccrentienyecshceuen 4.6 99 36 37.5 

















3. Distribution with these o are not representative of their occupa- 


tion groups because they are the average o of several distributions 
with different means. To obtain a truer SD (call it 2) use the formula: 


2? = o? + o 
distribution of average distribution of medians of 
occupation of group various occupations in group 


4. The fourth step is to find the ordinates of those distributions. 
First find the 2 distances of the points —40, —30, —20, —10, 0, 10, 20, 30, 
40, etc., from the medians of these distributions. Read from Table 
II of Pearson’s tables, the ordinates corresponding to these = distances 
and multiply by the percentages that each distribution is of the total 
(Table VIII, p. 40, 1910 census, Vol. IV) and divide by 2. 











72 


The Journal of Educational Psychology 


ORDINATES OF DISTRIBUTION OF OccuPATION GROUPS IN INTELLIGENCE ON SCALE 





a one 
2 : 
































or Army ALPHA 

Score 
corre- . ‘ Manu-| Trans- Profes- Total 
sponding — Misr | tester- porta- | Trade — sional se — popu- 
to Army | = ing tion ” group eas lation 

Alpha 
—100 fs 1 ee 1 
— 90 1 2 ee : 3 
— 80 2 4 wee i ; ryt 7 
— 70 6 8 1 3 ois . 1 19 
— 60 15 18 2 5 1 ‘ 3 44 
— 50 33 35 4 9 2 ' 6 89 
— 40 67 ~. 67 9 16 3 12 174 
— 30 126 1 116 18 26 4 23 314 
— 20 217 3 189 32 39 6 1 40 527 
— 10 349 7 285 54 57 8 2 65 1 828 
0 507 15 395 82 80 11 3 99 3 1195 
10 676 28 | 517 116 106 14 5 140 6 1608 
20 830 48 | 629 151 133 17 9 185 12 2014 
30 938 69 711 179 160 20 14 225 21 2337 
40 970 87 747 197 184 23 22 254 34 2518 
50 922 95 732 198 | 201 25 31 267 51 2522 
60 806 89 667 184 | 209 25 42 262 71 2355 
70 642 73 565 157 207 25 56 237 92 2054 
80 473 52 | 445 | 123 | 195 24 69 | 200 | 110 | 1691 
90 320 32 325 s9 | 176 | 21 81 157 122 1323 
100 199 17 225 59 151 18 91 115 126 1001 
110 112 rn 143 36 122 15 97 78 120 731 
120 59 3 84 20 95 12 98 49 107 527 
130 28 1 46 10 70 9 95 29 88 376 
140 13 24 5 49 8 88 16 67 268 
150 5 11 2 33 4 76 8 47 186 
160 2 5 1 21 3 63 4 31 130 
170 1 2 13 2 50 2 19 89 
180 1 8 | 1 37 1 11 59 
190 4 | 27 6 37 
200 2 18 3 23 
210 | | 1 11 1 13 
220 | 7 7 
230 4 4 
240 2 2 
250 1 1 
260 | 1 | 1 

| 

















The mean of this distribution is 50.66, the median 48.41 with o, 
43.00. This compares closely with the results of testing in the Army. 
On page 764 of the Army Report, I find the median Alpha score of the 
white draft, native born, given as 58.9; for the white draft, foreign 
born 46.7, and for the North and South division of the colored draft, 
respectively, 38.6 and 12.4. Lump these last two together for a rough 





Intelligence of Population in United States 73 


median of the colored draft as 20. Weight these in the ratio 68, 15, and 
17, approximately the percentage in the army of native born white, 
foreign born white, and negro, and we obtain a composite median of 
about 51. The close agreement of this median with my figures is 
remarkable. Very nearly the same median on the same scale is found 
for the distribution of intelligence of the whole population by two 
partially independent methods. 

There has been much speculation as to whether the Army figures 
are truly representative. Terman (’21) criticizes them on the ground 
that the exemptions favored the superior levels. Lincoln (’22) dis- 
putes this, arguing that if anything the exemptions favored the lower 
intelligences and that the Army took superior men. The exemptions, 
amounting to 6,973,000 out of 9,500,000 registrants, are divided as 
follows: 


Per Cent 
Occupational and industrial reasons.....................++. 6.8 
Re ake irre ae ae gles 4 nial Chae eee 0.5 
ES ohn hb ded Cet da bike 6.4 400s tas ohn 6e4 ene §1.1 
Already in military or naval service.....................45. 8.9 
EL. o ¢'0.6 50 bid eas0s. 200% 20 0s 3 Socanwe bee eetes 13.2 
PL bb osc Sew sbcbbidde nbdde barnes sitedintiaebabes 7.75 


Lincoln argues that exemption because of occupational and indus- 
trial reasons is on the basis of experience rather than intelligence. 
Likewise the Army report seems to think that the volunteers already 
in the military or naval service were above average intelligence. 
Granting that these two groups were above average (15.7 per cent), 
they are more than over-matched by the aliens and the physically, 
mentally, and morally unfit (20.9 per cent) who are evidently below 
the average. And the large group of those having dependents—who 
can say that marriage is positively or negatively correlated with 
intelligence? The fact that the median of my determination of the 
distribution of the intelligence of the population computed on the 
basis of occupations agrees so closely with the Army median is strong 
evidence that the Army group is partial to neither the intelligent nor 
the unintelligent. 

The cumulative curve is strongly skewed to the right, notwith- 
standing the fact that I made the assumption that the occupation 
groups were normal distributions. In making this assumption, I 
eliminated the imperfection in the Alpha scale which is seen in the 
piling up of zero and extremely low scores. This skewness depends 
on nothing except that the numbers in low-intelligence occupations 











i 
: 
| 
i 
' 
{ 
' 


74 The Journal of Educational Psychology 


so outweigh the numbers in high-intelligence occupations. As a 
matter of fact the true curve is probably even more skew than the 
one here pictured. Take, for example, the group ‘‘manufacturing and 
mechanical industries.” In that group laborers have a weight of 
25 as against the electricians with a weight of 1, foremen with a weight 
of 2, and minor executives with a weight of 1. Again, in the group 
“‘transportation,”’ laborers have a weight of 8 as against foremen with 
































RE ocean OND mn 
a a PRPESSIONAL 
———— PUBLIC SEAYICE 
oy Ra, Soe 
EERE re Tran Tio“ 
MaAsUPAC TUR EG 
RE MINING 
AGRICULTURE 
-60 -40 0 20 





60 1t0 160 200 240 


Fig. 1.—Curve of the distribution of intelligence of the population of the United States 
on the scale of Army Alpha. 


a weight of 1, conductors with a weight of 1 and telegraph operators _ 
with a weight of 1. In these groups it is evident that were the dis- 
tribution built up by cumulating separate occupations these, too, 


would be skewed to the right, making the total cumulation even more 
strongly skewed. 

An inspection of the frequency curves of the different occupation 
groups give the impression that there is a lower limit to the intelligence 
at which a man is socially acceptable as a worker, whereas there is no 
such clearly defined upper limit. Five of the occupation groups seem 
to disappear at about —40. Is it possible that —40 represents a 
lower limit of intelligence that is occupationally acceptable or neces- 





sem & @ &@ tee ee 


at 





Intelligence of Population in United States 75 


sary? ‘The census figures of the feeble-minded in 1910 show that about 
42,000 were classed as feeble-minded in special institutions and alms- 
houses. Granting that the real number in the country was 10 times 
42,000, they would make no appreciable difference in the form of the 
distribution, occupying an area no larger than the curve representing 
those engaged in public service. On the other hand there appears to 
be no such limiting upper level to intelligence. The fact that the 
professional group exists with a median so much higher than any other 
group suggests that the variation in this direction goes just as far as 
chance permits. 

I hope that these facts will impress all who read them with the 
preponderance of numbers among the low-intelligence occupation 
groups. We are so accustomed to think of the whole population in 
terms of our own acquaintances and neighbors. I remember being in 
a group which was listening to an English lady who had “‘seen” our 
country and had reached the stage where she could give her “‘impres- 
sions.” It was just after the New York City municipal elections and 
she was discussing the results. “It is strange,’’ she said, “‘no one I 
have met in this country has any use for Hylan and yet he was elected 
by a large majority. Where are the people who voted for him?” 
After all, we know only the people most like ourselves and forget that 
there are the others. Workers in educational measurements have 
committed the same errors that this English lady made. We have 
obtained our standards in schools that we find in our districts, in 
schools where we would send our own children, in schools run by men 
who are in sympathy with educational measurements. The result is 
an unconscious but nevertheless real selection. 

It appears that the Stanford Revision of the Simon-Binet is stand- 
ardized from measurements of a selected group. Terman (’16) says, 
page 52, “The method was to select a school in a community of 
average social status, a school attended by all or practically all the 
children in the district where it was located.”” What is a community 
of average social status? On page 55, Terman says, “Figure 1 shows 
the distribution of mental ages for 62 adults, including the 30 business 
men and the 32 high school pupils who were over 16 years of age.”’ 
It will be noted that the middle section of the graph represents the 
“‘mental ages” falling between 15 and 17. So Terman bases his aver- 
age adult standard on business men and high school students! High 
school students will probably average an Army Alpha of 100 to 120, 
and the business men probably about the same. 








76 The Journal of Educational Psychology 


On page 286 in ‘‘ The Intelligence of School Children’? Terman gives 
a table of IQ’s of various vocational groups. I copy part of his table; 
I add to his table the median Army Alpha score of these groups and 
the Q, where they are known. 











, Median 
Vocational group Median Q: Alpha | Q, 
IQ Score 
CATS op ‘ 

ES Oe ee | 109 104 
Business men....... ae MUR foie Occ ee | 102 97 
Express employees......................-. | 95 87 
Motormen and conductors................. 86 79 83 64 
Firemen and policemen.................... 84 78 69 46 
hed adit wr eidiea a wae n bh eeae oben 85 77 52 38 
Hoboes and unemployed................... 89 71 

















On the basis of these figures, if we take the median intelligence of 
the whole population to be 48, Army Alpha, it would appear that the 
median IQ of the general population would be about 80, say between 
80 and 82, as Terman computes adult IQ’s. What then is the median 
mental age of the general population? 0.82 K 16 = 13.2! And 
Lincoln says: ‘‘ Those results indicate that the Army Mental Age norms 
were somewhat high and that the average man may have been slightly 
under 13 mentally.” Here then is a reconciliation of the divergent 
facts that have troubled psychologists these 3 years. There has been 
no error. The Stanford revision of the Binet-Simon Scale has been 
standardized on selected individuals. It is true that the average 
Stanford-Binet Mental Age of the average man is 13.2 years, or about 
13 years. 

This, then, is an explanation of the discrepancy between the fact 
that the average mental age of the average adult is only 13.2 while 
mental functions seem to continue to increase until perhaps the age of 
18, as found by Brooks and others. The Stanford-Binet is imperfectly 
standardized. Ability to do half of the tests in the 14-year group repre- 
sents ‘‘average adult” ability. Call this a mental age 16 or whatever 
you wish. Then the others should be scaled up accordingly. The 
results of this study seem to show decisively that the Stanford revision 
needs restandardizing. It is not representative as it stands. One 
hundred IQ on the Stanford Revision does not indicate the average 
man—it represents a man considerably above the average. Probably 








o> wet CO cf he 





yes 
le; 
nd 


“. Ww Oo w@ S&S TF 





Intelligence of Population in United States 77 


an IQ of 100 represents very nearly an Army Alpha score of 100. A 
glance at the graph shows how far from being a representative median 
this is.! 

There are several discrepancies that may be explained in the 
light of this analysis. Garrison and Tippett (’22) found that the Otis 
Advanced Test gives a higher mental age than the Binet-Simon Test. 
This can well be explained by the selection of groups which were used 
to standardize the tests. Undoubtedly Otis’s norms represent a more 
representative selection of the general population, coming as they do 
from some 11,500 cases. As it was, Garrison and Tippett found that 
the Otis Mental Ages ran from 1 to 2 years higher than the Binet and 
averaged 17.5 months higher. Proctor (’20) gives a distribution of 
Stanford-Binet IQ’s in the high school as follows: 


IQ NUMBER 
125 14 
120-124 11 
115-119 11 
110-114 16 
105-109 1l 
100-104 15 

95- 99 16 
90- 94 11 
85- 89 7 
80- 84 1 

113 


The median here is 107 IQ. Compare this with the accompanying 
graph showing the distribution of high school pupils on the scale of 
intelligence. Proctor has 31 per cent of the pupils in his group in 
high school below the average intelligence... The actua! distribution 
of first year high school pupils on the Alpha Scale shows that less than 
10 per cent are below average intelligence. To be consistent with the 
facts which we have noted above, it would be more natural to assume 
that Proctor was working with a typical high school group and that 
the discrepancy lies in the standardization of the Stanford Revision. 


1 A corroboration of the above has come to my attention. In the Manual of 
Directions accompanying the Terman Group Test of Mental Ability a table of 
the mental age equivalents is given on page 10. The score on the group test 
corresponding to a mental age of 16 is 135 and on the previous page a score of 
134 is given as the median score of the high school junior. The average man has 
the intelligence of the average high school junior? Preposterous! 











sel to cer ttn eg TN 


a ree or 


seFetse 9 cer tiie 


» en eee 
3 


es ~ be 


ee, a eeanioe 


78 


The Journal of Educational Psychology 


I believe this study gives us considerable help in the problem 
of obtaining unselected standardization. Where shall we go for the 





-60 -40 i) 40 60 120 160 200 240 


Fria. 2.—Curves of the distribution of intelligence of occupation groups on scale of 
Army Alpha. 

average man? ‘The average man has an Army Alpha score of about 48. 

Fryer’s table shows that representative occupations around this level 

are masons, hospital attendants, station agents, miners, teamsters, 





-8& 49 ° 40 60 120 160 208 240 


Fie. 3.—Curves of the distribution of the intelligence of first year High School 


students and first year College students as related to the distribution of intelligence of 
the total population on the scale of Army Alpha. 


riggers, boilermakers, airplane workers, factory storekeepers, horse 
shoers, salesclerks, hostlers, barbers, stationary engineers, cobblers, 
horse trainers, caterers, bricklayers, auto truck chauffeurs, farmers, 








em 
the 


Intelligence of Population in United States 79 


concrete workers, printers, and bakers. To obtain representative 
figures children and adults should be tested coming from social groups 
of which the above listed occupations are typical, not high school 
students and business men. There are as many in the population who 
are less intelligent than the average semi-skilled workmen in the 
occupations above as there are of those who are more intelligent. 
Every person who wishes to obtain norms or standards representative 
of the total population can do no better than to consider carefully 
the occupation groups in which he proposes to do his testing. There 
seems to be no better ready criterion. 

Suppose that the correlation between the intelligence of parents 
and children was 0.50, certainly a low estimate. Then the mean of the 
intelligence of children of parents of the professional group would be 
given by the formula 


M: = ms +r? (Mi _ m,) 


where M; is the mean intelligence score of the children of 
fathers who are in the professional group. 

mz is the mean intelligence score of all children. 

r is the correlation between the intelligence of parents and 
children. 

o2 is the SD of the intelligence of all children. 

a, is the SD of the intelligence of all fathers. 

M, is the mean intelligence of fathers in the professional group. 

m, is the mean intelligence of all fathers. 


We may assume that o, = oz and m; = mz since we are hypothetically 

measuring on some common scale of intelligence. Then on the scale 

of Army Alpha 

M2 = m2 + r(M; — m)) 
= 51 + .50(118 — 51) 
= 84.5 


That is, there is always a regression of the children toward the mean of 
the whole group whose correlation is given. Undoubtedly the correla- 
tion is much higher than 0.50 for the whole population, but this hypo- 
thetical example illustrates the principle. 

But there is another point to be considered—boys and girls in 
the upper grades and high school represent a double selection. They 
are not only selected because of their fathers, but they are already 








es es ee 


ne ea et — ares ie hem . 
: s 


ee es ea 








80 The Journal of Educational Psychology 


selected because of their own capacities and their own prospective 
vocations. Because of this double selection I am inclined to believe 
that to restandardize the Stanford Revision it will not be enough to 


simply scale the tests down in the ratio of ie The amount of selec- 


tion grows more intense in the upper levels where we are approaching 
more nearly the adult selection and away from the simple parental 
selection. The regression toward the mean is greater for children low 
in the grades than for high school boys and girls who are beginning to 
experience a selection on their own account. A true standardization 
must take into consideration both the occupation group from which the 
children are selected and the amount of regression of the mean of those 
offspring toward the mean of the race. 


In building up my distribution of the intelligence for the popula- 
tion I have made three assumptions which evidently depart from 
the truth. 

1. I assume that the separate occupations have normal distribu- 
tions. I felt that because of the evident inequality of the units of the 
Alpha scale at its lower extremity, this method would not actually 
distort the truth more than some more exact method. The operation 
in which I used this assumption was in taking o= 74 (Q; — Q;) which 
is only true for a normal distribution. 

2. I assumed that the composite occupation group was a normal 
distribution. Another method that I might have employed was to 
build up the complete distribution as a summation of separate occu- 
pations. Here again I believed that because of the inequality of the 
units of the Alpha scale-at its lower extremity this method would not 


.distort the truth more than some exacter method. However, as I 


showed above, if anything, the occupation groups would also skew to 
the right making the total distribution still more skewed. 

3. I assumed that the separate occupations have the same varia- 
bility in applying 


y+ = go a ; o? 
distribution of average distribution of median of 
occupation of group various occupations in group 


This formula is strictly true only when all the distributions are equal 
in variability, but it is a close enough approximation to take an average 
of the separate distributions when they are of different variability. 





ve 
ve 
to 


ig 
i] 


ah. ee en © | 


Intelligence of Population in United States 81 


CoNCLUSIONS 


1. A distribution of the intelligence of the population of the United 
States built up on the basis of occupation-intelligence standards and 
the number in each occupation compares very closely in mean and 
variability with the distribution found in the army. 

2. The distribution is distinctly skewed to the right. 

3. There seems to be a lower limit to the intelligence which is 
occupationally acceptable, but there seems to be no such upper limit. 

4. The Stanford Revision of the Binet-Simon was standardized 
from asuperior group. The median IQ of the general adult population 
is in the neighborhood of 82 IQ on the Stanford Revision. 

5. The Stanford Revision should be restandardized so that the 
tests on the average adult level (in the 14-year group) may be labelled 
with a mental age which more nearly coincides with the age at which 
mental growth stops. 

6. In the future consider the occupations represented in the groups 
selected for the purpose of standardizing or obtaining norms. 

7. In using occupation groups for standardization bear in mind 
that there is a regression of children toward the mean of the race from 
the mean of the fathers. 

8. This regression is greater for young children than for older 
children where a second selection is taking place looking forward to 
their own occupations based on their capacities. 


REFERENCES MADE IN THE TEXT 


Fryer, D. (’22): Occupational-Intelligence Standards. School and Society, 
Vol. XVI, No. 401, Sept. 2, 1922, pp. 273-277. 

Garrison, 8. C. and Tippett, J. S. (’22): Comparison of the Binet-Simon and Otis 
Tests. Journal of Educational Research, Vol. VI, No. 1, pp. 42-48. 

Lincoln, E. A. (’22): The Mental Age of Adults. Journal of Educational 
Research, Vol. VI, No. 2, pp. 133-144. 

Proctor, W. M. (’20): Psychological Tests in Educational Guidance. Journal 
of Educational Research, Vol. I, No. 5, pp. 369-381. 

Terman, L. M. (’19): “The Intelligence of School Children.” 

Terman, L. M. (’16): “The Measurement of Intelligence.” 

Terman, L. M. (’22): ‘‘ Manual of Directions for Terman Group Test of Mental 
Ability.” 

Terman, L. M. (’21): Mental Growth and the IQ. Journal of Educational 
Psychology, Vol. XII, No. 6, pp. 325-341. 

Occupation Statistics “‘Thirteenth Census of the United States,’ 1910, Vol. 
IV. 

Yerkes, R. M. (’22) ed.: Psychological Examining in the United States Army. 
National Academy of Sciences, Vol. XV. 





2 


4 > We ee 


. a = TFs 
a aS anal 


tw oe . 
SS ee ee ee 
: r ~ 

. if 


. 
<— m 


SS a Ua ese eel —‘é‘yY? _ — 






whee ent -ee = tre, ae 


THE GRAPHIC RATING SCALE 
MAX FREYD 


University of Pennsylvania 


I 


Owing to the immense importance of ratings in psychological 
experimentation, both pure and applied, constructive effort should 
be directed toward improving the means whereby ratings are obtained. 
For many types of psychological phenomena ratings are the only prac- 
tical equivalents of objective measurements, and this applies especi- 
ally though not exclusively to introspective or verbal report data. 
It is, of course, the aim of the psychological experimenter ultimately 
to be able to present his data in the form of quantitative or qualitative 
measurements objectively arrived at, but where under present condi- 
tions this is impossible, he should seek the least subjective form in 
which his data may be presented. An effective rating scale may fall 
short of that much-desired objectivity, but in skilled hands it will 
provide measures equal in accuracy to those obtained by slipshod 
objective means. Rating scales have a wide use in psychology, and 
their construction demands the same skill and care as the laboratory 
set-up. 

Galton’s study of mental imagery furnishes a splendid example 
of the possibilities of a rating scale in pure psychology.* The data 
of studies in mental imagery such as Galton’s consist of verbal state- 
ments which are interpreted by the experimenter. Differences in 
reports on mental imagery are probably conditioned largely on the 
subject’s originality of expression. This error may be diminished if 
the subject is not required to make a spontaneous report, but indi- 
cates which of a number of descriptive phrases furnishes the most 
accurate description of his own imagery. These phrases can be 
arranged in a scale representing gradually increasing degrees of mental 
imagery. Galton supplied the framework of several such scales, but 
did not make use of them in the manner here indicated. He obtained 
from a large number of individuals verbal accounts of vividness of 
imagery, and ranged representative accounts in a scale from those 
expressing a vivid imagery to those expressing the practical absence of 
imagery. If one is interested in obtaining quantitative measures of 
imagery, one need only attach numerical equivalents to these descrip- 
tive phrases, and allow the subject to indicate the number of the 

83 


* ? 
ee ee we eee 


: . 
ee 


- 


OL GO SET LOE LLC LI A I ETE LG CL EN, LT IRE a at ye 


~~ 
s ee 


3 








ee nal et > 7 
ba . 





| 


84 The Journal of Educational Psychology 


phrase which is the most accurate description of his own imagery. A 
fair equivalent of an objective measure is thus obtained. 


GALTON’sS SCALE OF VIVIDNESS OF MENTAL IMAGERY FROM INTRO- 
SPECTIVE REPORTS FURNISHED BY 100 MEN 


Highest.—Brilliant, distinct, never blotchy. 

First Suboctile——The image once seen is perfectly clear and bright. 

First Octile-—I can see my breakfast-table or any equally familiar 
thing with my mind’s eye quite as well in all particulars as I can do 
if the reality is before me. 

First Quartile—Fairly clear; illumination of actual scene is fairly 
represented. Well defined. Parts do not obtrude themselves, but 
attention has to be directed to different points in succession to call 
up the whole. 

Middlemost.—F airly clear. Brightness probably at least from one- 
half to two-thirds of the original. Definition varies very much, one 
or two objects being much more distinct than the others, but the 
latter come out clearly if attention be paid to them. | 

Last Quartile——Dim, certainly not comparable to the actual scene. 
I have to think separately of the several things on the table to bring 
them clearly before the mind’s eye, and when I think of some things 
the others fade away in confusion. 

Last Octile—Dim and not comparable in brightness to the real 
scene. Badly defined with blotches of light; very incomplete; very 
little of one object is seen at one time. 

Last Suboctile—I am very rarely able to recall any object whatever 
with any sort of distinctness. Very occasionally an object or image 
will recall itself, but even then it is more like a generalized image than 
an individual one. I seem to be almost destitute of visualizing power 
as under control. 

Lowest.—My powers are zero. Tomy consciousness there is almost 
no association of memory with objective visual impressions. I recol- 
lect the table, but do not see it. 

In applied psychology, Downey demonstrates how a rating scale 
may be used in scoring a test which does not yield an objective quanti- 
tative measurement.” The test referred to is the Resistance to Oppo- 
sition Test of the Will-Profile Series. In this test the subject is 
required to write his name with his eyes shut. While he is engaged in 
this, the tester places an obstruction, such as a small pasteboard box, 


in f 


the 
sca 


DE 


lev 


of 
ot! 


of 
n¢ 


d 


e 











The Graphic Rating Scale 85 


in front of the pen-point, exerting enough pressure so that considerable 
effort is required to continue writing. In estimating the reaction of 
the subject to this unexpected opposition, Downey makes use of the 
scale of deciles reproduced below. 


DeEcILE SCALE FOR SCORING REACTION IN RESISTANCE TO OPPOSITION 
TEST OF THE WILL-PROFILE SERIES 


10. Strong pressure against obstacle. Writing maintained at initial 
level; firm, strong stroke, usually enlarged characters. No urging. 

9. Very strong counter-pressure on level, but with some sacrifice 
of form; letters blurred or telescoped; trembling or undue speed, or 
other evidence of agitation. No urging. 

8. Very rapid and energetic dodging with or without maintenance 
of form; often, increase in size of letters. Or strong pressure but 
not on level. No urging. 

7. Very deliberate but gentle counter-pressure. Or deliberate 
dodging with mild counter-pressure and little loss of form. Or holding 
examiner’s hand back with the left hand, or protecting one’s own 
hand with left hand. No urging. 

6. Evasive reaction: Reversal of movement; shift of position; 
jumping of obstacle. No urging. 

5. Very mild counter-pressure; loss of form. No urging. 

4. Strong pressure AFTER URGING and READJUSTMENT 
with maintenance of form. 

3. Moderate pressure after urging with maintenance of form. Or 
deliberate dodging after urging. 

2. Moderate counter-pressure after urging with some attempt to 
preserve form. 

1. Feeble pressure after urging with loss of form. 

0. Absolute passivity in spite of urging. Typical remarks: ‘I 
can’t.” ‘How can I when you stop me?” 

Plant describes a rating scale which is to be used in both pure and 
applied experimentation.‘ This rating scheme for conduct has for its 
practical purpose the betterment of nurses’ notes in psychiatric hos- 
pitals—giving the nurse an idea of what is wanted, and causing her to 
observe and state facts rather than draw general conclusions. The 
scale will also apparently be used in putting to experimental proof 
“‘Helmholtz’s assertion that the law of conservation of energy does not 
apply to mental phenomena.” Ratings are made with this scale on 

































A OTT ee ee se ee 


ee Se 


-_ 
ot. 


LN NEES RE er A GE CRIES IE ie 


eo OTT 








OD ie tO e 


~we 


Se ee 
s 
. 


- 
ee 
* 


ere 
xy 


- Safe PRS on 


s 


} 
{ 
P 
{ 


86 The Journal of Educational Psychology 


16 partitions or aspects of personality, each partition being measured 
by a scale of approximately 10 phrases somewhat like the one quoted 
from Downey. Plant obtains the mean standing in these 16 partitions 
at any one complete rating, and traces the fluctuations of this mean 
over a period of time. The standard deviation is also traced. Plant 
considers the latter a valuable diagnostic sign, its size being an index 


of the patient’s mental confusion. The scale for grading attention 
is given below. 


PLANT’s SCALE FOR RATING ATTENTION 


One of a Number of Such Scales for the Use of Nurses in Psychiatric 
Hospitals 

1. Stuporous. 

2. Can’t hold attention long enough to do even commonest 
things such as completely dressing self or eating a meal. 

3. Dresses self but can’t hold attention long enough to do any 
particular work. 

4. Can do only childish pieces of work. Cannot fit a picture 
puzzle of more than 15 or 20 pieces. 

5. Can do only childish pieces of work if they are new. Will do 
very long and complicated pieces of work along lines he has been 
working on—as picture puzzles. 

6. Can sew for half an hour orso. With the men—those who can 
play a game of checkers or billiards but does nothing requiring a longer 
time. Leaves task half finished—to take up some other task. 

7. Remains interested in a piece of work until the end of the day, 
but next morning has forgotten it or has no interest in it. 

8. Will work for a day, or day and a half, on a piece of work, and 
finish it. 

9. Often stops, even for days, in a task requiring a long time but 
goes back to it over and over again until it is finished. 

10. Plans and carries out a piece of work requiring a long period 
of time, as weaving a rug or making a piece of pottery. 

In these scales Plant attempts to recapitulate ontogeny in the order 
of the sub-divisions. He says regarding this rating scheme: ‘On its 
face we are dealing with a facultative psychology, long since discarded. 
That is, however, not the case since the terms are social and not 
psychological.’”’ In other words, the measurements are of objective 


- rather than subjective phenomena, in terms of behavior rather than in 








The Graphic Rating Scale 87 


terms of introspection, but still psychological. The facultative objec- 
tion, however, could be precluded if the 10 divisions represented steps 
on a continuous line. ! 

Many similar examples could be cited from the field of educational 
measurements, such as the various handwriting scales. In these cases, 
however, the steps on the scale are based on the decisions of competent 
judges. By invoking the aid of these judges the experimenter is 
enabled to add to the accuracy of the steps on his scale. 

The above illustrate the possibilities of the use of rating schemes in 
pure and applied psychology, and indicate how they render measure- 
ment possible with seemingly incoherent data. 


When an objective method of measurement attempts to supplant ~ 


a rating method, as in the case of psychological tests, it is easy to lose 
sight of the fact that the objective measurement is no more reliable 
than the ratings which it displaces. It is important, therefore, when 
building up objective methods of measurement to take the place of 
ratings, to obtain first of all the most accurate and reliable ratings 
possible, by refining the methods whereby ratings are obtained. 

The recent development of the graphic rating method warrants 
this discussion, in order that the scale may receive wider use, and its 
merits and demerits put to stricter tests. 


II 


There are innumerable possibilities in the way of methods of 
rating. The list which follows gives some motion of the diversity of 
means at hand. 

Let us assume that we wish to measure a group of men with regard 
to some phase of personality, as self-consciousness. We may obtain 
expressions of the degree of self-consciousness in the members of the 
group by instructing judges to rate them by any of the following 
methods. 

1. Have each subject rated as self-conscious, or not self-conscious. 

2. Have the subjects arranged in order from those displaying the 
greatest self-consciousness to those displaying the least self-conscious- 
ness, and have them rated from 1 to n. 

3. Assuming 100 per cent to represent the greatest degree of self- 
consciousness possible in any one person, we may have each individual 
in the group rated in terms of the degree of self-consciousness which 
he possesses. 


Be Nhe Se OAT AO 



































ee 


athentintnenthiicnnrenl antnteeeineeredietn ates eee 


en = ct ee 


me — 


ae 








88 The Journal of Educational Psychology 


4. Have the judges select the men who are outstandingly self- 
conscious, and the men who are outstandingly not self-conscious. 
They may place as many men as they choose in these two classes, or 
they may be required to place a certain number of the whole group 
in the two classes. These men may be rated as self-conscious or not 
self-conscious; or + or —. The rest of the group may be rated as 
average or neutral, or be assigned the symbol ? 

5. Have the judges place the men in several groups (3, 5; or 10 
are often employed), according to the amount of self-consciousness 
which they display. Assign terms to these various groups, as high, 
average, and low; or good, fair, and poor; or 1, 2,3, : . . 3 a, 5b, ¢, 

. , ete. 

6. The judges may indicate their opinion of the amount of each 
man’s self-consciousness by checking one of a number of symbols, as 
follows: +! + +? -—? — —!; +!1+? -— -!; Y!Y?N NI; 
Yy ? NN; Y representing yes and N representing no in answer to 
the question: ‘‘Is this man self-conscious?” 

7. Have each of the judges select 5 members of the group, one 
being extremely self-conscious, another not being self-conscious and 
the remaining 3 representing intermediate degrees of self-conscious- 
ness. These men should be given ratings of 10, 8, 6, 4, or 2, according 
to the amount of their self-consciousness. The judges may then pro- 
ceed to rate the remaining members of the group by assigning each of 
them a number from 2 to 10 in comparison with these 5 representative 
men. Thisisthe Army Rating Scale method. 

8. Draw a straight line to represent the range of self-conscious- 
ness, and have the judges indicate each man’s self-consciousness by 
making a cross along this line. 

9. A large number of phrases descriptive of varying degrees of 
self-consciousness may be collected, and arranged in order as in the 
examples cited from Plant and Downey. These phrases may be 
numbered from 1 to 5, or from 1 to 10, depending on the number 
of phrases. The judges may indicate their rating by checking the 
phrase which corresponds most closely to their estimate of the man’s 
self-consciousness. 

10. The graphic rating method is a combination of the two 
preceding methods. In this case the rating is indicated by a 
check along a straight line, under which are printed descriptive 


phrases indicative of varying degrees of the trait, from one extreme 
to} the other. 


ee 


_ 
Le es lel CU 








wae oo a “ss o a 


The Graphic Rating Scale 89 


11. A method which may be employed where ratings on men are 
desired in a series of traits, is to pair each of these traits with each of 
the others in the series, and to ask the rater which trait in each pair 
is the more pronounced in the person rated. The rater may thus 
consider the subject more “self-conscious” than “‘intelligent,” if these 
traits are among those enumerated. 4s all traits will appear the same 
number of times, the degree to which a certain trait is dominant in 
the personality of an individual may be deduced from the frequency 
with which that trait is chosen when paired with other traits. The 
traits may be ranked for each person in the order of their presence in 
him. This hardly constitutes a fair basis for comparing one person 
with another in regard to any particular trait, as the traits may retain 
the same relative value in two individuals yet all be present to a greater 
degree in one person than in the other. 


Il 


When confronted with all these possibilities in the way of rating 
methods, we immediately ask ourselves, which is the most effective and 
desirable? If some means were at hand for determining without 
equivocation and by a method not based on ratings, the exact degree of 
a certain trait in all individuals, we should be able to compare the accu- 
racy of ratings made by the various scales. But if such a method of 
objectively measuring the trait were available, we should have no need 
for the rating scale. Furthermore, such a trait may differ so funda- 
mentally from those traits which are not capable of objective measure- 
ment (in the sense that the rater would have more cues) as to limit the 
broadness of the conclusions to be drawn. 


Ratings are ultimate things, and the comparison of the various . 


rating systems cannot be found by recourse to an external criterion. 
In the writer’s opinion there are no flawless methods of evaluating 
rating scales. The criteria which have been advanced may be divided 
roughly into those which appeal to such factors as ease of administra- 
tion and scoring, popularity, and so arare and those which employ 
statistical reasoning. 

The non-statistical criteria include such items as the ease with 
which one may grasp the directions for making the rating, the time 
required for completing the rating, the agreeableness of the rating 
task, the simplicity of the scale, the universality of the scale, the ease 
with which the rating may be scored, and so forth. These are impor- 
























ery wee Sten a 


St silts ancl 


ee 


a let tilt - 


a Ne oe Ee 


=) Ta, TP: +. ~ 
ene ee SY Pa 
- it ete eee 


<I“. — 


rE ae ae 


_— 


i ahd tzz, 


7 2 et Se 


an 





5 te eae 


"se 
eer ~ 





{ 
; 
: 
: 


90 The Journal of Educational Psychology 


tant criteria, unless one has access to trained judges with unlimited 
patience. 

There are approximately seven statistical criteria. One criterion, 
used in the case of ratings on intelligence, is the comparison of rat- 
ings with intelligence test scores. Rugg'® used this criterion (among 
others) with the Army Rating Scale, and found negative results. This 
may of course be construed as an argument against the narrow con- 
ception of general intelligence admitted in the employment of an intel- 
lectual intelligence test, as well as against the rating method. One 
conclusion to be drawn from such results is that more attention is 
needed to the determination of the exact abilities measured by tests. 

The following from Hayes and Paterson includes two other criteria 
of the reliability of a rating scale: ‘‘The graphic rating method was 


_ found to be highly reliable, as shown by the close relationship between 


ratings on the same men by the same judge for different months, and 
by a close relationship between ratings on the same men by different 
judges.” (Rugg also uses the latter criterion in showing the weakness 
of the Army Rating Scale.) Certain cautions are called for when such 
criteria are employed. Should a judge’s ratings vary or remain con- 
stant from month to month? Under certain conditions, is it not to 
be expected that a judge’s estimate must c!.ange from month to month 
and may not a lack of such change indicate a weakness in the scale? 
If a judge rates a person the same on successive occasions it may be 
an indication that he has learned nothing new about the subject’s 
personality in the interval; that he did, and wished to avoid the detec- 
tion of his initial misjudgment; or that he did, and the scale afforded 
him no means of altering his judgment. In the opinion of the writer, 
agreement between judges is a more valid criterion, because if this 


“agreement exists it is an indication that the scale calls attention to 


universally noted characteristics and makes them the basis for the rat- 
ing; provided the judges have had equality of opportunity to judge the 
subject, and no judges are included toward whom the subject would 
display a peculiar attitude. The data from which Hayes and Paterson 
draw their conclusions consist of ratings by foremen on their workmen. 
We should expect these judges to agree in their ratings of any of their 
subordinates, but a different story might be told if these subordinates 
were rated by their wives and their fellow-workers. We should expect 
agreement among raters who bear the same social and industrial rela- 
tionship to the subjects, and this agreement may be considered a valid 
criterion of the worth of a rating scale. 



















A = 


The Graphic Rating Scale 91 


<a 


—_— 
Te natal ha = 


= Sn pe a a rR 
. a a ee entender eee a 


Fourth and fifth criteria relate to the form of the distribution of 
ratings—its normality and its spread. As to the normality of the dis- 
tribution, we have no notion of the true distribution of the trait, 
nor could we assume that because the distribution of ratings corre- 
sponded to the true distribution of the trait, the ratings were correct. 
We may generalise in the statement that with a lagge number of cases 

_ a spotty distribution indicates that certain portions of the scale are 
being neglected by raters, or that the steps on the scale are not of equal 
value. This matter of equalizing the steps on the rating scale involves 
considerable labor, and is usually omitted except for very refined work. 
If the scores on the rating scales are transmuted into ranks, nothing 
is lost by inequalities of steps on the rating scale. Spread of distribu- 
tion is an important factor, since we must have sufficient discrimination 

~ between abilities in order to figure correlation coefficients and to dis- 
tinguish between one man’s ability and another’s. Too great a spread, 
however, adds to the error of any single rating. , 

Thorndike” and Rugg’ call attention to a constant error in ratings, 
namely, the tendency for the judge to be influenced in his ratings on 

~ the specific traits of an individual by a general attitude or set toward 
that individual. If the judge likes the person, he rates him high in 
everything; if he dislikes him, he rates him low in everything. The 
error due to the formation of a “halo” about a person is evidenced 
when high correlations are found between ratings on unrelated traits. 
Thorndike reports finding correlations of +0.58 between intelligence 
and leadership, +0.51 between intelligence and physique, and +-0.64 
between intelligence and character, when one judge rated 137 aviation 
cadets on the Army Scale. He also reports an average correlation of 
+0.67 between general ability for officer work and so highly specialized 
a quality as flying ability, when the same men were rated by eight 
judges. The average rater will not rid himself of this bias and exer- 
cise his analytical powers, unless the scale itself aids him to do so, and 
the absence of halo may be considered a sixth criterion of the efficiency 
of arating scale. With the same judges rating the same subjects, that 

“scale which shows the highest correlation between obviously unrelated 
traits may be considered the weakest. 

A seventh criterion, one mentioned by Plant, is as follows: Present 
@ person with a list of his acquaintances, and with a set of ratings on i. 
one of the men in the list. Ask him to indicate which of the men is } 

the one to whom the ratings apply. 


Sebunedeillicdeatinenslimaainetiinune: 


oe ed 


_ 


a al ee le 


ee 


how 2 a hal 
TM ie ge 


ee 
> il gees. gale ee - 


ee | er 





wt Cee ae 
— ur 


a epee gs er 


J egerree 


eet Pca 


i 


92 The Journal of Educational Psychology 


IV 


The graphic rating method, which was originated in the Scott Co. 
Laboratory in 1920, is the latest development in rating methods and 
promises to be the most popular. Its only original feature is the com- 
bination of the methods of rating on a line and by checking descriptive 
terms, both of which were in prior existence. Several graphic rating 
scales are now being used by the Scott Company and by the Bureau of 
Personnel Research at Carnegie Institute of Technology. Among 
Terman’s materials for the study of gifted children may be found some 
modifications of the graphic rating method.’® 

The following directions and illustrative items are taken from a 
seale developed by the writer at Carnegie Institute of Technology.’ 


GRAPHIC RATING OF 


ee | 


Instructions for Using the Rating Scale 


1, Let these ratings represent your own judgments. Please do not consult anyone in making 
them. 


In rating this person on a particular trait, disregard every other trait but that one. Many 
ratings are rendered valueless because the rater allows himself to be influenced by a general 
favorable or unfavorable impression which he has formed of the person. 


. When you have satisfied yourself on the standing of this person in the trait on which you are 


rating him, place a check at the appropriate point on the horizontal line. You do not have to 


place your check directly above a descriptive phrase. You may place your check at any point 
on the line. 


2. 


3. Does he appear neat or slovenly in his dress? 


Oe 


Extremely neat Appropriately and Inconspicuous Somewhat careless Very slovenly 
and clean. Almost neatly dressed in dress in his dress and unkempt 
a dude. 


9. How does he impress people by his physique and bearing? 


ee) 


Looked down on Unimpressive physique Noticeable for good Excites admiration. 
and beariag physique and bearing Very impressive 


13. How flexible is he? 


Hidebound. Slow to take up Progressive Quick to pick Is always adapting 
Runs in a rut new ideas tendencies up new ways himself and taking 
and habits up new ideas 


18. Is he quict or talkative? 


Pe ee ee ee | 


Talks seldom. Does not uphold his Moderately More than upholds Great talker. 
When questioned end of the con- talkative his end of the Always going 
answers briefly versation conversation 


1 This seale will be given in full in a forthcoming monograph by the writer.* 








20 
or 








Fd 


“SS VS he UY © 


The Graphic Rating Scale 93 


This Rating Scale afforded ratings on 20 such traits. In 10 of the 
20 traits the “‘good”’ end of the scale was at the left. The 20 ratings 
on any individual may be made in about 10 minutes. 

To score the ratings a stencil was used, one of whose edges was 
marked off for a distance equal to the length of the graphic rating line. 
This space was divided off into 20 consecutively numbered spaces of 
equal length. The stencil was placed beneath the line on which the 
rating was made, so that this line coincided with the marked off space 
on the stencil. The score was the number of the space over which the 
check was made. An X-shaped check was scored at the intersection of 
the two lines; a V-shaped check at the pointoftheV. If more thanone 
check was made on the same scale for the same subject, indicating 
doubt on the part of the rater, the average of the ratings was taken. 
No total score was obtained, as the scale was not used in any study 
requiring this step. 

The scales in use by the Scott Co. differ from this one in several 
respects. In their scales the ‘‘good”’ end is always at the left; scores 
range from 1 to 10 instead of 20; and a total score is obtained on a 
number of traits. In order to correct for the tendency for some raters 
to rate too high and others too low, a distribution table of total ratings 
is made for each rater. The ratings of different judges are equalized 
by assigning the same numerical final rating to the upper 10 per cent of 
each judge’s distribution, the next 20 per cent, the middle 40 per cent, 
the next 20 per cent, and the lowest 10 per cent. Each subject’s final 
rating, then, represents not his raw score in the scale, but his standing 
in terms of each judge’s standards of ratings. 


V 


What advantages does this method of rating have over other methods? 

The two basic features of the graphic rating method, according to 
Hayes and Paterson, are the following: 

“1. The rater is freed from direct quantitative terms in judging 
men.” 

“2. The rater can make as fine a discrimination of merit as he 
chooses.” 

According to these writers, the scale is ‘‘simple, self-explanatory, 
concrete and definite.”’ 

Scott asserts that people like to use this sort of scale.’ Any scale 
which makes the rating task interesting is of advantage when rating 


































ere = 
~ 


Se Ete Rt te Oe a te ee 
' - 





iglesias ait quien iandiane ake es aeaen 


gee.) TE, eT a SP PRE RERNEAE 0 


ect aint AE Pa ED 





4 





94 The Journal of Educational Psychology 


scales are to be returned by mail, or when there is no good incentive for 
their use by the judges. 

The graphic rating method has the following general advantages: 

It is simple and easily grasped. 

It is interesting and requires little motivation of the rater. 

It is quickly filled out. 

It is simply and easily scored. 

It frees the rater from direct quantitative terms. 

It enables the rater, nevertheless, to make his discrimination as fine 
as he cares, although this discrimination is lost if a scoring stencil of 
only a few points is used. 

The descriptive terms aid the rater in that they make the various 
degrees of the trait more concrete. Manyscalescall for ratings on such 
qualities as ‘‘neatness,”’ giving merely a definition of this word.. 

It is universal; that is, no master scale is required as in the Army 
Scale. When a group is rated by several judges, corrections may be 
made for varying standards of the judges by the Scott Co. method. 

The fineness of the scoring method may be altered at will, yielding 
scores of from 1 to 5, or from 1 to 100. 

It allows of comparable ratings without requiring each rater 
to know all the members of the group. 

The scale yields a “close relationship between ratings on the same 
men by the same judge for different months.”® The data for this con- 
clusion are presented in one of the Scott Co. bulletins, and consist of 
three sets of ratings on a number of workmen by nine foremen, at 
intervals of a month. The average correlation of ratings for the first 
month with those for the second month is +0.76 (the lowest is +0.52). 
The average correlation for the second and third months is +0.87 (the 
lowest is +0.66). 

The scale yields a ‘‘close relationship between ratings on the same 
men by different judges.”*® The Scott Co. bulletins give the correla- 
tions obtained between the ratings of pairs of foremen on their workers. 
The average correlation between seven pairs of foremen is +0.71. 
These were the first of a series of monthly ratings of their workers. 

The correlation coefficients given in the above two paragraphs were 
obtained with total ratings on seven traits. No indication is given by 
the authors of the degree of relationship between ratings on specific 
traits. 

The form of the distributions of ratings will vary with the construc- 
tion of the scale, as well as with the true distribution of the trait 




















The Graphic Rating Scale 95 


among the persons rated. With poorly constructed scales the checks , 
are for the most part made directly above the descriptive phrases. 
Figure 1 shows the distribution of 100 self-ratings by college students 
on trait 4 of the writer’s scale. (The manner in which these figures 











50 tee 
40 
30 
e 
© 20 
NS) 
v 
a rm 7] 
fe] : d 
Very good-natured. Agreeable Rather glum and Grouchy and Vv 


Fic. 1.—Distribution of self-ratings by one hundred College students in Trait 2 of the 
Graphic Rating Scale. 


were obtained will be described later.) This scale is defective in that 
the central phrase does not describe the average amount of the trait. 
In Fig. 2 the distribution shows greater spread, but is spotty due to the 














40 
30 
yi 
© 20 
. 
e ix) 
, me 
Very excitable and —_Easily sturred Usually cool and 


Fig. 2.—Distribution of self-ratings by one hundred College students in Trait 4 of the 
Graphic Rating Scale. 


fact that ratings were made for the most part directly above phrases. 
In Fig. 3 the tendency to check above phrases is diminished and there is 
a greater spread inthe distribution. These ratings were all made under 
the same circumstances and with the use of the same directions. 


- ~ a Uv 3 ™ > . Z ows 
a a Sic iT tll ate sl pn m,n 


oa a ee Te P . 


a» GeO Ses oe 


sities ene tae ee 
—y ° - a teen aes on Pa 


ee rege oe 
a 











ROE ROAST ~ one . 





a ee 


a hegeplar 5 ~SE ee, aberration ce. 


ne ee tage 





96 The Journal of Educational Psychology 


The difference between the distribution in Fig. 2 and that in Fig. 3 can 
only be explained by the assumption that the scale for trait 15 is a 
superior scale. The Scott Co. bulletins report symmetrical distribu- 
tions of ratings, with the use of 10 scores instead of 20. 

With regard to the elimination of halo, little can be said that is 
definite. Theoretically, we should expect a diminution of the halo 
with the use of the scale herein described, since the directions clearly 
explain this tendency and warn against its presence; since both 
extremes of a scale may represent undesirable qualities; and since the 
items are alternated so that a motor tendency to check at one side of 
the sheet is seemingly eliminated. The traits, furthermore, are not 
described only in terms of their desirable extreme. 








zo 
> 1 
a alll 
° : — 
Often confesses his At times unburdens Will occasionally Never unburdens. 
thoughts and feelings spontaneously to unburden when Rarely talks about 
to friends and friends questioned himself 
acquaintances 


Fig. 3.—Distribution of self-ratings by one hundred College students in Trait 15 of the 
Graphic Rating Scale. 


Some data which the writer has collected may throw a little light 
on the interrelationship of abilities as rated on the graphic rating 
scale, although the manner in which the data have been developed 
admittedly leaves some doubt as to the finality of the conclusions to be 
drawn from them. They are not directly comparable with the data 
presented by Thorndike to show the presence of halo. 

Each of 100 college students was asked to rate himself on the scale 
presented in this article, by considering himself from the standpoint of 
on impartial observer. (These are the men some of whose self-ratings 
are shown in Figs. 1, 2, and 3.) Similar scales were then sent to five 
af the acquaintances of each of these men, but with rare exceptions, 
no two men were rated by the same judge. All self-ratings and ratings 
of others were made anonymously. All individuals were excluded from 
further study on whom less than twosuch ratings by acquaintances were 
available. There remained 84 students all of whom had rated them- 
selves, and who had been rated on the same scale by from two to five 
acquaintances. The ratings of acquaintances for each of the students 
were then averaged. This average rating on each trait for each student 











The Graphic Rating Scale 97 


was then averaged with the student’s own rating on each trait. The 
result was a series of 20 measures of each student, in which his self- 
rating and the average rating of his acquaintances were given equal 
weight. These final average ratings were intercorrelated, resulting 
in 190 correlation coefficients (Table I). 

These intercorrelations were all computed by the fourfold table 
method (method of unlike-signed pairs). Furthermore, in obtaining 
these correlation coefficients, one extreme of the scale was considered 
the “good” end by an arbitrary decision. The ‘‘good” name was 
given to the trait, and a rating at the good end was considered high, 
and at the bad end, low. This necessitated in half the cases reversing 
the scores obtained by the scoring stencil, since the values on this 
stencil ranged in size from left to right only. The effect is as if the good 
end of the scales were always at the right margin instead of being 
alternated, and as if the trait were described by the extreme at the right 
margin. 

If the tendency to form a halo about a person were common to 
all judges, we should expect the judges to rate people on the same 
dead level in all traits. A person would thus be rated consistently 
high, consistently low, or consistently mediocre, and this situation 
would not be modified if averages of various ratings on a person were 
used instead of the ratings of only one judge: The averages would 
nevertheless be consistent. The result would be an exaggerated inter- 
correlation of various traits as measured by ratings, provided the 
favorable extreme of the trait were always considered high. This is 
the phenomenon to which Thorndike has referred. Our procedure is 
postulated on the phenomenon of halo being as probable in self- 
ratings as in the ratings of others. 

The intercorrelations thus obtained are given in Table I, and 
Table II shows the frequency distribution of these correlations. High 
coefficients are the exception. The highest is between flexibility 
(adaptibility) and quickness in work (+0.66), perhaps higher than 
we should expect. The other positive correlations of 0.51 or more 
are between quickness in work and present-mindedness, good-nature 
and even-temper, cool-headedness and cautiousness, freedom from self- 
consciousness and good-bearing, freedom from self-consciousness and 
sociability with the other sex, sociability and sociability with the other 
sex, sociability and talkativeness, open-heartedness and talkativeness, 
sociability with the other sex and talkativeness. There is a correla- 
tion of —0.54 between appreciation of others and talkativeness. 





































































ae 
eo Ze |6t (90 |90-—|1@ |99 |6I-|90 (ze [oF (90 (e0—|te (Ze [6I—-|6T [OT [Ig [ec YOM Ul SGOUHROMY “OZ 

£0 €I-|ze jet j6t—|e0 |Fe-—\9t (se (6t jze (80 |9f j7e je0—\28 [OT (Se fer [ct 04843 ONSIBIV “6T 
= 2e \6I- SI-|Ig |Ig¢ |Ig |ze |ss-—l9I—/em [oz (8e-—|9e (80 (2% [S8—-\6I jOT OT jt SSOTOANVHIV], “ST 
SS 6 [Ze (eI- ¢0—-|6I—|s0 |et (90 |80 \s0-—|90 (€0— 00 (90 [Ob [9I—-|90—\8T [6E [TT «PATON, “LT 
S 90 |st |I¢ |¢o—- zw i9¢ ite isI-|90 (90 [et 190 (69 (St [Ot lot j6t jez jez j''°° X98 J0Y4}O YIM AZTIQBIIOg “gT 
S o0-(61-\1¢ let—lez 1€ |9f |t@—-|9I-—|90 \80—|\2z—0—/€0 9I—-|\6I—|te 90 9O—-| SSOUPOZIBIY-UEdQO “CT 
=> te leo |t¢ leo joo te €¥ j€0-/00 [00 [€0 [00 [ss [Gt jee jet jos jtg jem [ie te eters tees AYPQBVpeg “FT 
R, 99 |pe-lze let |ez jot ier £0-|90 |9f ist |90-—\S2 |6I-—|9f [6I-—|te [62 [SB [CT AIGA “ST 
— 61-91 |F8-—|90 |SI—|te— \€0— \so— ¥€ (0b—|90-—|0F 90 (00 |fI-|9t [OL ite i8I-j TC 819430 JO uOT}EINeIddy “ZT{ 
S 90 |¢% |9I—|80 |90 |9I—\|00 (90 (|t€ sp ive 00 |S |80-|90 |6t 19 |I9 JOO |°°°°*c tee Jodure}-ueaAq “TT 
- 2& [61 (€% |0-—|90 (90 [00 |9t |0F— |S eS ee kD ae ee ak ne.» 1109489-J]9G “OT 
S oF ize 2 (90 |et |e0-—\€0 (ct [90-—\t8 FE i fe eee ieee eeionbe ate be: Basveq poor *¢ 
S 90 |€0 |st—|0-—|90 |2z—-|00 (90-—|0F (00 (9I-— sI iso |b |9F [tS |SI-jor jee ["""****’ to RSet ORY eED Ss OES Ss9USNOINGD “g 
™ €0-(91 |st [00 |69 \g0-\¢e [ez (90 [Ge |¢0 |I¢ (se | 90 ict jee jes ice iow |******’’’ SS9USNOSU0I-jJ98 WIOIJ WIOPIEI *Z 
SQ 2% j€0 /90 |€t j€0 (6 |6IT-|00 |\€0-\€t (cz [sh (90 ee S&S OD ie id a ok: yiom ut Aovinooy “9g 
w, 48 80-|2% [OF |9I j\9I-|%e [ot [et—|90 (Of (0 |9t 2 |6r a re. poet ee tee ee tees oeas aees UOT}IOSSB-JJOQ “¢ 
S 6t-—|ze j¢¢—l9t—jer jet—iet jet—jot jet jet—\ze |e eB [Le [6 eas sor tho OP ons Seen SSOUPSPBEY-[00D “Ff 
S 6% Of |6t j90—j6t ite [st [te jot |Ig [G0 \st jeI- (6% 00 jsz— 6I- RE a eat a ih de ae SS0Ip Ul SHOUWWONT “Eg 
S gt ice jor jet |e j90 |t¢ jee |e |1¢ |oo ite jor ice (ez \ez—'e0 or ae Bh* Sho es oes e coreg 9° ee @1IN{WU-Poor) *Z 
3 If (€f |9f [6f |€2 |90—|Sh [st l€I-—|00 |90 [St |2e [Ob FE [OT [C0-(0F (Oh | [ttt SsOUpepuUIUI-j;UVseIT *T 
s i: ini Weal all | 

} | ' } 
= oz | ot | st | zt} or | st} sr} er|eri}imior! 6) s | L | 9i¢i/rieleli qery, 
| | 


















































PezzTUIO WB S}UIOg [BUlIET 
(SASVOD FR) SONILVY AOVUAAY AO SNOLLVIRUNOOUTLNI—'] AAV, 


98 














The Graphic Rating Scale 99 


With one or two possible exceptions, these results do not exceed 
expectation. 

Insofar as these figures are concerned, it may be safely ventured 
that the graphic rating scale tends to eliminate halo. 


Taste II.—DistrisuTion or CORRELATION COEFFICIENTS GIVEN IN TaBLe I 


AMOUNT OF 
CorRELATION 
CoEFFICIENT FREQUENCY 


— .51-.60 
.41-—.50 
.31-.40 
.21-.30 
.11-.20 
.01-.10 

.00 
.01-.10 
.11-.20 
.21—.30 
.31—.40 
.41—.50 
.51—. 60 

+ .61-—.70 


RESReaSawor 


— 
— - =] 


VI 


There are certain rules, based on experience, which it is well to 
follow in constructing a graphic rating scale. The following is a fairly 
complete list of points to be considered in making a scale. 

Define the trait on which you wish ratings. It will often be found 
that what one considered a single trait is in reality composed of a 
number of well-defined separate traits. Define the trait in terms of 
what the individual actually does, for the more concretely the trait is 
expressed, the greater is the expectation that raters will agree. 

Decide on the extremes of the trait. It is frequently the case that 
one extreme of a scale may have several opposites. 

It will be found good practice to introduce every scale with a 
question, to which the rating furnishes the answer; for instance, the 
question ‘‘ How tactful is he?” or ‘“‘Is he tactful or tactless?’’ may be 
answered by checking on the rating line. 

The rating line should be of such a length that a stencil for scoring 
the rating can easily be calibrated. 

There should be no breaks or divisions in the line. 

The line should not be much more than five inches in length, so 
that it may be grasped as a unit. 


SR eet 
. - 




























Sabie icnainllhcnd inp set atta. ncatecean 
" 


a 





pie area mantinigers:, <P EEF 





100 The Journal of Educational Psychology 


There should not be more than five descriptive items nor less than 
three. 


The end phrases of the scale should not be so extremely worded 
as never to be employed. 

The phrase descriptive of the neutral or average degree of the trait 
should be in the center of the scale. 

If there are five items, the intermediate ones should be closer in 
meaning to the center one than to the extremes. This has the effect 
of spreading the distribution. The same end may be accomplished 
by making the intervals on the scoring stencil smaller in the center 
than at the ends. 

Only universally understood phrases should be used. Slang is 
effective if there is no doubt as to its meaning. 

Such terms as average, very, extremely, excellent, good, fair, or poor, 
should be used sparingly. Use in their place adjectives which in 
themselves express the varying degrees of the trait. In place of 
extremely neat one might say fastidious, or in place of very careless in 
dress one might say slovenly. 

The descriptive phrases should be short and to the point. 

These phrases should be set in small type, and there should be 
plenty of white space between them. 

The favorable extremes of the scales should be alternated so as to 
do away with a motor tendency to check at one margin of the page. 

The construction of stencils for scoring the scales will depend on 
the purposes for which these scores are to be used. If correlations 
with other variables are desired, the same stencil may be used, reading 
from left to right. If a total score is desired on any one individual, 
two or more stencils will be necessary; one reading from left to right, 
one from right to left, and one perhaps reading from the center in 
both directions, in case it has been found that the central phrase 
describes the most favorable degree of the trait for any special voca- 
tional purpose. There are innumerable possibilities in the way of 
handling the data. 

Rugg voices some cautions which must be observed in using the 
Army type of scale, two of which may apply to any sort of scale: (1) 
That the final rating be an average of three independent ratings, 
(2) made by men who are thoroughly acquainted with the persons to 
be rated. Raters should be selected with the same care as a jury, 
and those who are subject to prejudice or bias, or who have numerous 
complexes, should be eliminated. In judging the character of a person 





rh 





The Graphic Rating Scale 101 


it is important to know that others are biased against him, but it is 
more important to know the cause of that bias, and we cannot depend 
upon the biased raters for this information. The true reasons for dis- 
like are often hidden by rationalizations. 


Vil 


Graphic rating scales may find use in a variety of psychological 
studies. The following are some of the uses to which they may be put. 

As criteria in evaluating tests of personality. 

As criteria in evaluating vocational tests. 

For measuring test responses. 

For rating applicants for positions on traits which are at present 
impossible of objective measurement. 

For rating improvement in an employee. 

For vocational guidance. 

For rating clinical cases. 

For rating children on deportment. 

For measuring the effect of drugs and other variables on efficiency. 

For any psychological experimentation involving verbal reports. 

The use of the scale will necessitate definite concrete thinking on 
the problem, and will aid in analysis and the avoidance of snap 
judgments. 


BIBLIOGRAPHY 


1. Bureau of Personnel Research, Carnegie Institute of Technology. Mis- 
cellaneous graphic rating scales. 

2. Downey, J. E.: The Will-Profile. 2nd Edition, University of Wyoming, 
Department of Psychology, Bulletin No. 3, 1919. 

3. Freyd, Max.: The Personalities of the Socially and the Mechanically 
Inclined. To be published in the Psychological Review Monograph Series. 

4. Galton, F.: “Inquiries into Human Faculty.” 

5. Hayes, M. H. 8S. and Paterson, D. G.: Experimental Development of the 
Graphic Rating Method. Psychological Bulletin, Vol. XVIII, 1921, pp. 98-99. 

6. Hollingworth, H. L.: “Judging Human Character.” N. Y., 1922. 

7. Industrial Relations Association of America. Proceedings of Annual 
Convention, Chicago, 1920. 

8. Kenagy, H. G.: Is the High Cost of Distribution Due to Misfit Salesmen? 
Printers’ Ink, Nov. 2, 1922. 

9. Kornhauser, A. W.: The Psychology of Vocational Selection. Psychological 
Bulletin, Vol. XIX, 1922, pp. 192-229. 

10. Mendenhall, G. N.: “‘Self-Measurement Scale.’”’. University of Iowa. 

11. Miner, J. B.: The Evaluation of a Method for Finely Graduated Estimates 
of Abilities. Journal of Applied Psychology, Vol. I, 1917, pp. 123-133. ° 


* ¥ ox ’ =3 - . 
- . = Sal 
eee cieethdiee sateen an een ; - . ' 
satinssa petites deahtiaiemente dine een ep an aE. 
. : ; " 




























ie 





. 
‘ : 








nay OLS 


Sa, 


102 The Journal of Educational Psychology 


12. National Association of Corporation Training. Ninth Annual Proceed- 
ings, 1921. 

13. Oschrin, E.: How Rating Scales are Working Out in Practice. Lefax 
Magazine, September, 1922. 

14. Plant, J. 8.: Rating Scheme for Conduct. American Journal of Psychiatry, 
Vol. I, 1922, pp. 547-572. 

15. Rugg, H. O.: Is the Rating of Human Character Practicable? Journal of 
Educational Psychology, Vol. XII, 1921, pp. 425-438, 485-501; Vol. XIII, 1922, 
pp. 30-42, 81-93. 

16. Scott, W. D.: Do You Want to Know What Others Think about You? 
American Magazine, November, 1920. 

17. Scott, W. D. and Hayes, M. H. 8.: “Science and Common Sense in Work- 
ing with Men.”” New York, 1921. 

18. The Scott Co. Laboratory. Bulletins on the Graphic Rating Scale. 

19. Terman, L. M.: “ Materials for the Study of Gifted Children.” Stanford 
University. 

20. Thorndike, E. L.: A Constant Error in Psychological Rating. Journal of 
Applied Psychology, Vol. IV, 1920, pp. 25-29. 











































THE UNRELIABILITY OF THE DIFFERENCE BE- 
TWEEN INTELLIGENCE AND EDUCATIONAL 
RATINGS 


J. CROSBY CHAPMAN 
Department of Education, Yale University 


It is the irony of fate that those who, five yearsago, were attempting 
to overcome the scepticism of practical schoolmen with reference to 
mental and educational tests should within so short a period have to 
warn the erstwhile sceptics that they are now putting too much faith 
in the instruments which such a short while back were the objects of 
their distrust. With the entrance of “‘intelligence tests” and ‘‘school 
tests,’ it was a great temptation to ‘‘measure”’ “intelligence” and 
“‘school achievement,”’ and then by the difference in standing to 
estimate the extent to which an individual was taking advantage of 
his school opportunity. The general idea is so attractive and the 
results, if true, so useful that schoolmen have been captivated by the 
simplicity of a definite figure which promised to give such valuable 
information with regard to the pupil and the school. Provided suffi- 
ciently accurate differential instruments are available, no one doubts 
that the procedure is most useful, but in the absence of such instru- 
‘ments I have been much shocked by the rigid manner in which the 
differences in intelligence level and school level. resulting from single 
tests of each, have been interpreted. The single measure of intelli- 
gence and the single measure of school achievement have both been 
treated as though they were the ratings of two isolated traits made by 
an hypothetical but infinitely wise judge. This work promises to 
spread so rapidly that it seems advisable to issue certain caveats which 
are the result of an examination of its logical and statistical basis. 

While lip service is given to more exacting definitions of intelli- 
gence, most of the devices in common use make no pretence at measur- 
ing rate of learning in a controlled situation and under specific time 
conditions, but are content to assume that the amount learned, or 
facility acquired from the experiences afforded by a supposedly 
fairly uniform environment gives the most reliable clue to intelli- 
gence. Accepting this idea, that the extent of the benefit derived 
from past experience will be used as the test of intelligence, where can 
we find for any grade a more uniform environment than that pro- 
vided by the school? Therefore it follows that the achievement in 
standard school tests by its very nature must be a fairly satisfactory 
103 


oe 
OO eee Ngo wee - 


7 . : . 
ON RN eer Ree TU oe ee ee 
~ 





~ Ohare tee 


104 The Journal of Educational Psychology 


index of intelligence. The psychologist sits in his laboratory and 
devises ‘Intelligence Tests” and his friend, the educator, dropping in, 
remarks on the excellence of a considerable part of his material as a 
test of school achievement. The following day the educator sits in his 
schoolroom, constructing his “School Achievement Tests”? and his 
friend the psychologist later comments on the suitability of large 
portions of his material as measures of what he considers to be intelli- 
gence. In spite of this amusing agreement, the two tests emerge, one 
definitely labelled ‘‘Intelligence Test’’ and the other equally definitely 
labelled “‘School Products Test.’’ Under the assurance afforded by 
these definite labels, their partial similar nature and common origin 
are soon forgotten. Without any compunction ‘Intelligence Ratings” 
and “School Achievement Ratings” are treated as measures of 
the raw material involved and the quality of the finished product 
respectively! To a certain degree the two tests measure the same 
traits and whatever traits they measure they are themselves unreliable, 
as is shown by the repetition of each using identical forms, yet the 
difference in achievement in the tests is made the basis of a Mental- 
educational-differential Index or of a Mental-educational-achievement 
quotient. The injustice of many decisions made on such assumptions 
has caused me to calculate theoretically for a population made up of 
a single grade the reliability which can be placed in a measure which 
is the difference between achievement in the so-called intelligence test, 
and achievement in the so-called school products test. The remainder 
of the paper confines itself to the derivation of the general formula 
and the application of this formula to several problems. It may be 
said that the results indicate how extremely sceptical we must be of our 
present practices. 

Suppose two standard intelligence tests, such as any two of the fol- 
lowing—National, Otis, Pressey, Terman, etc., etc.—and two standard 
classroom products survey tests, such as the Lippincott-Chapman 
Test or the Illinois Examination or any other similar combinations are 
given to a group. Let the two intelligence tests be labelled J; and J: 
and the two school battery tests S; and S, and let the scores of each 
individual in these four tests when expressed as deviations from the 
means be 21, X2, 3, x4. Then the reliability of the measure of the 
so-called differential achievement in intelligence and school work is 
dependent on the degree of the correlation between the differences 
when found with one set of measures say J; and S; and the difference 
when found by a second set (supposed equally acceptable) I2 and S,. 


Le 
an 


Ti 








sa 
his 
his 


li~ 
ne 
ly 
IY 
in 


of 








Difference between Intelligence and Rating 105 


Returning to the symbolic notation, the trust that we can put in the 
procedure is dependent on the degree of correlation between 
S . Bee .. Ss 


C1 03 02 C4 
Let the symbolic representation of the correlation be r,..«, 
and let correlation between 
I 1 and I 2 = Ti2 
S3 and S4 = T34 etc. etc. 


roe — 2h DG 2) 
eC - 


L4122 LDL3%4 P 2ZXslq DXi 


Then 
































es 7102 7304 o302 7104 ; 
{22r 2 22123 zee 222? 22224 ry rz," |" 
a” 7103 v3" a2" 7204 oy 
but 
D41X2 2)? 
eS Pr 12, etc., also a es = 
0102 Ci 
whence 
NT i2 + Nr34 — Nes — Na 
Ta,a, nt 


if (n — 2nris + n)*(n — 2nreg +n)” 
1 rig + 134 — Te3 — Tis 
2 (1 — ris)*(1 — re)” (X) 
Formula X reduces to simpler terms if we make the assumption which 
is in general accordance with the fact that 
rae Pru fe = Te 
In this case the equation becomes 





1 = intelligence /, 

2 = intelligence J: 

3 = school test Ss; 

4 = school test S, 

rire + Ts3s4 — Trns4 
2 


1 — rss 


Rewriting this putting 





ta = 








106 The Journal of Educational Psychology 


or neglecting subscripts 


Yn + 1Tss y 
a = Tie 
2 





(Y) 





Va,a, _ 


l == Tse 


Applying Formula X to certain problems, we select first some data 


in my possession obtained from a group of 208 unselected Grade VII 
pupils. 


Correlations 
National Intelligence and Pressey Intelligence riz = .48 
National Intelligence and School Product Test! 713 = .61 = ri 


Pressey Intelligence and School Product Test res = .54 = reg 
School Product Test and School Product Test 73, = .75 (assumed) 


1 48+ .75 — .54 — .61 
Tae, ~ 2 (1 — .61)%(1 — .54)% 
= .094 
Taking similar data from two studies by Gates? which although issuing 
from selected groups, shows the unreliability of the differential 
procedure. 
Average correlation for each grade of Grades VI, VII, VIII. 


Otis Intelligence and National Intelligence Tig = .59 
National Intelligence and Thorndike-McCall Reading ris = .45 = rig 
Otis Intelligence and Thorndike-McCall Reading res = .57 = ro 
Thorndike-McCall and Thorndike-McCall Reading r3, = .59 
whence 


For these data 





To,a, = -164 


For another group of tests, taking again average correlations for 
several grades 


Illinois Intelligence and Otis Intelligence Ti2 = .54 
Illinois Intelligence and Thorndike-McCall Reading ry, = .53 = riz 
Otis Intelligence and Thorndike-McCall Reading re3 = .61 = re, 


Thorndike-McCall and Thorndike-McCall Reading rs4 = .57 
whence 


Toa, = —.035 


1The School Product Test was the Lippincott Classroom Products Summary 
Test consisting of four parts. 

2Study of Reading Tests, Journal Educational Psychology, October, 1921. 
Correlation of Achievement in School Subjects with the Intelligence Tests and 
other Variables. Journal Educational Psychology, April, 1922. 





di 


sl 





90 Sa ELIT IO - til 








set PPE Mtting 





Difference between Intelligence and Rating 107 


The low value of r,., in both cases proves how unreliable is any ver- 


dict within a grade group which involves intelligence measured by a 
single test and reading ability also measured by a single test. 
The standard error (root mean square error) made in estimating 


x z x . : 
~t __“§ from —* —** by the relationship 
0; G3 o2 04 
X1 v3 
o} "(2- =) (2 = =) “2 a 
ai = zs 
a2 a4 


is given below. 


o* _s x2 x 1. a (1 ap "é =) (2 =) 
Substituting for the three problems considered 
Problem I. Standarderror=¢, ,(1 — .009)” 


Problem II. Standarderror =o, (1 — .028) 


Problem III. Standard error =o, (1 — 001)” 


From these three values of the standard error it.is apparent that the 
difference in standing in a single test of intelligence and a single school 
achievement test gives almost no basis of prediction within a typical 
grade group of what the difference will be when two other similar 
tests designed to measure the same factors are employed. 

Assuming that the intelligence test, even though unreliable, is a 
perfectly valid intelligence measure, and that the school battery test, 
even though unreliable, is again a perfectly valid school measure, 
Formula Y provides us with a means whereby we can calculate the 
accuracy with which the intelligence test and the school test must 
function, if we are to make predictions of the differential educational 


index with a degree of probability which a correlation coefficient © 


Taj¢o = .75 allows. This, be it noted, is still an insecure basis of 
prediction. 
We will assume, as is reasonable, that the true correlation of the 
ideal intelligence test and the ideal school test is .7 
Then substituting in formula 
ruttss 7 
9 . 





BS 


































108 The Journal of Educational Psychology 


whence Tr + ‘ss = 1.85 
or if Tyr = Tss 
then TH = lss = .93 


Within a population of a grade the various intelligence tests among 
themselves and the various school tests among themselves certainly 
cannot be relied upon to correlate better than .7. Using Brown’s 
formula we may therefore calculate, on the above assumption that each 
are perfectly valid measures, how often we must repeat the measure 
of intelligence and the measure of school achievement to get the 
necessary correlation of .93. 


Let n = number of times the tests must be repeated, then 


a7) 
1+(n. — 1).7 © 
whence n = 6 (approx.) 





.93 


Such facts as are presented above must be recognized by those who 
propose to determine the difference within a single grade, of intellectual 
and school achievements when measured by such instruments as are 
at present available. The psychologist must restrain the ardor of the 
commercial houses when making fanciful claims for the efficiency of 
their tests. These claims may be sanctioned in the realms of commer- 
cial advertising but they can never be justified at the more exacting 
bar of statistical truth. 


fo PORE * 








— 











IS IT NECESSARY TO WEIGHT EXERCISES IN 
STANDARD TESTS? 


HARL R. DOUGLASS 


Professor of Education, University of Oregon 
AND 
PETER L. SPENCER 


University High School, University of Oregon 


The Problem.—The purpose of this article is to raise the question of 
whether it is necessary te_weight the separate exercises or questions 
that go to make up a standard test and to give some statistical data 
bearing on the question raised. 

Test and scale papers are usually scored in one of three ways. 
When scales are used the most difficult exercise completed correctly 
in the case of difficulty scales, or the sample highest in quality in the 
case of quality scales is taken as the measure of the specimen being 
scored. Among those of this type of measuring instrument are the 
Starch scales for English grammar and for arithmetic, the Thorndike 
Drawing scale, the Ayres and the Thorndike scales for measuring 
handwriting, the Willing, the Harvard-Newton and the Hillegas scales 
for measuring the quality of composition. These two types of measur- 
ing instruments are usually referred to as (1) difficulty or power scales 
and (2) scales for quality. 

In scoring papers in standard tests a cumulative method of scoring 
is used. Tests with respect to method of scoring are of two classes 
according to whether the unit exercises or questions making up the 
test are weighted or not. Some tests employ the simple method of 
counting each unit exercise or question as of possessing unitary value. 
This is true of practically all tests where speed rather than power is 
the thing measured and of all tests where the problems or exercises 
are selected so as to be of equal difficulty. Of this type are the Courtis 
Research Tests, the Courtis Supervisory Tests, the Monroe Standard 
Tests for Algebra, the Monroe Diagnostic Tests for Arithmetic, the 
Handschin Tests for French and for Spanish. 

Many standard tests are made up of exercises, problems or ques- 
tions of unequal difficulty and value in order that adequate opportunity 
for diagnosis and for measurement of various degrees of power might 
be provided, or in some cases, where it was found impractical to obtain 
exercises, problems or questions of equal difficulty and requiring equal 
109 





110 The Journal of Educational Psychology 


amounts of time for completion. Of this class of tests are the Monroe 
Standardized Silent Reading Tests, the Stone Reasoning Tests, the 
Kansas Silent Reading Tests, the Henmon Latin Tests, and the 
Douglass Algebra Tests. 

The cumulative method employed in scoring these tests involves 
the weighting of the unit problems, exercises or questions. The 
weights are usually expressed in units of P.E. (M.D.) orS.D. recalculated 
with reference to an arbitrarily determined zero point. The methods 
used in the determination of these weights, with slight variations with 
regard to the arbitrary determination of the zero point, are practically 
uniform and established by custom and reason among those devising 
scales and tests and need not be described here. It is sufficient to say 
that the procedure is a long and tedious one, requiring much time and 
involving possibility of errors. 

In the scoring of tests where weights are used, a great deal of time 
and effort is involved in adding the assigned weights. It is much 
simpler merely to count the number of exercises, problems or questions 
completed correctly. The percentage of errors resulting from the 
adding of weights is higher than many are aware, as those who have 
gone over test papers previously scored by another know. Conse- 
quently if it could be shown that little or nothing would be lost by 
using the simpler plan of counting the correct responses, not only 
could the time involved in scoring the test papers be materially 
reduced and the possibility of error decreased, but the intricate 
process of determining the weights could be eliminated. Charters! 
reported that in the case of his Diagnostic Tests for Language and 
Grammar, the correlation between the rankings of pupil’s scores when 
the items of tests were weighted and rankings when no weights were 
used was something over 90 per cent. This is very significant and 
the question raised is of sufficient importance to warrant further 
investigation. 

It must be admitted that the ideal scale would measure exactly 
any amount of the ability or quality under measurement with reference 
to an absolute zero. It has not yet been demonstrated that absolute 
zero in any school achievement can be determined. In the determina- 
tion of zero points of scales where zero points have been located, 
methods have been used which aim at but arbitrary approximations. 
This being the case, measurement by means of our present supply of 


1 Charters, W. W.: Constructing a Language and Grammar Scale. Journal 
of Educational Research, Vol. 1, April, 1920, p. 255. 











\y 


Weighting Exercises in Standard Tests 111 


tests and scales gives us but little more than relative standing. For 
the purposes of exact measurement of ability as for some purposes of 
psychology, something is lost in the matter of comparison of absolute 
amounts by using only relative standings. The significance of this 
consideration in educational practice is, however, almost an academic 
consideration and might easily be overestimated. For the usual uses 
of educational achievement tests and scales—for the purpose of 
diagnosis, for use in motivation by means of measurement, and for 
purposes of comparison and survey—relative standing suffices. 

The Data.—One of the authors of this report some time ago com- 
pleted the derivation and standardization of a series of diagnostic 
tests for the fundamental operations of algebra. The scoring of these 
tests involves the use of weights and norms have been determined on 
that basis. These tests were given to a class of 25 Junior V (Grade IX) 
pupils in the University High School of the University of Oregon. 
Two sets of scores were calculated, one with the use of weights, the 
other without. For the purpose of checking, coefficients of correla- 
tion were found according to two methods; the method of paired 
measures, and the short method of grouping and calculating by group 
intervals.” 











{ 
ein & | Pearson method of | Rugg device for 
¢ paired measures | grouping 
| | 
Test I, collection of terms............ .98 +.005 | .999 + .006 
Test II, multiplication............... .99 + .002 .996 + .001 
Test III, subtraction................. | .995+.001 | .961+-.01 
Test IV, solution of simple equations... .996 + .001 .991 + .002 





| 





Note—The variations in the coefficients thus found are due 
to the grouping of the measures in class intervals and using the mid- 
point of each interval to represent in the calculations each measure 
in the interval. 

These correlations were surprisingly high and uniform and it was 
decided to proceed with the investigation using other tests. Test 
papers written by classes in the same school using the Henmon Latin 


1 Douglass, Harl R.: A Series of Standardized Diagnostic Tests for the Funda- 
mentals of First Year Algebra. Journal of Educational Research, October, 1921. 
? Rugg, H. O.: “Statistical Methods Applied to Education.’ Houghton 
Mifflin Company, 1917, pp. 260-270. 





























ie 





112 The Journal of Educational Psychology 


Tests and the Gregory Language Tests, were used as material as were 
test papers written by members of a Grade VII class in the Ashland 
(Oregon) junior high school where the Monroe Standardized Silent 


Reading Tests had been given. Coefficients calculated from these 
were as follows: 








Pearson method of | Rugg device for 
Number ; : 
paired measures grouping 

Henmon Latin Test A....... i .985 + .006 .996 + .0011 

Gregory Test for Language... 30 .991 + .0022 .975 + .0061 
Monroe Standardized Silent 

. Rate .993 + .0017 

Reading Test............. 32 Cosngiehensicn. "978 + .0052 




















= >. —— er 











RETENTION AFTER LONG PERIODS 


DEAN A. WORCESTER 
Psychological Laboratory, University of Colorado 


In the summer of 1915 the author learned twenty-one 100-word 
selections from the works of Arnold and Huxley. Twelve of these 
were learned by the auditory method and nine by the visual method 
of presentation. The method of learning was as follows: 

On one occasion the subject was handed a slip of printed material 
which was to be read over and over until the subject judged himself 
able to repeat the subject matter exactly. On the alternate days the 
experimenter read a selection aloud until in the judgment of the subject 
the selection could be repeated exactly. The recitations were oral. 

Only one selection was presented at a sitting. The whole method 
was always used. 

The rate of reading was not pre-determined but the subject was 
allowed to receive the presentation at that speed which seemed to 
him desirable. Unless requested by the subject to do otherwise the 
experimenter read aloud at a rate of two words per second, that is, 
a 100-word selection was read in 50 seconds. Frequently, the subject 
would request that the rate of reading be increased as the memorizing 
became more nearly complete. This increased speed seemed to be 
justifiable as it is unquestionably the natural method. 

The subject was asked not to attempt a reproduction of the 
material until practically sure that it could be given exactly. As 
a matter of fact, it was finally decided to accept and record a recitation 
which did not fall below 95 per cent of perfection. Retention was 
tested by the retained members method. Learning was carried only 
to the point of one successful reproduction and retention was tested 
after 1, 2, and 7 days from the time of learning. Times were kept 
to the even second with a stop watch. 

After approximately 5 years, the retention of this material was 
tested by the relearning method with the interesting results shown in 
the following table. 

It should perhaps be observed that nearly all of the relearning was 
done at night and following a day’s work at the University. 

113 








114 The Journal of Educational Psychology 


AVERAGE RETENTION OF Matertat LEARNED BY AUDITORY AND VISUAL PRE- 
SENTATION AFTER APPROXIMATELY FIvE YEARS 


AuDITORY VisvaL 
Average lapse of time................. 4 years, 357 days 4 years, 360 days 
Average learning time 1915............ 13.03 minutes 6.29 minutes 
Average learning time 1920............ 7.24 minutes 3.41 minutes 
Average saving of time................ 44.44 per cent 43.09 per cent 


Whereas the material had not been reviewed during the interim, 
this seems to be a large percentage of saving. 























DRILL IN ARITHMETIC 
F. B. KNIGHT 


University of Iowa 


Given the following drill material for a lower grade: 


(1) 6 (2) 4 (3) 1 (4) 8 (5) 9 (6) 8 (7) 9 | 
7 6 9 7 2 0 0 
' 4 2 6 1 7 4 4 
3 3 5 4 3 3 7 

2 5 3 5 4 6 8 | 
(8) 5 (9) 4 (10) 9 (11) 5 (12) 6 (13) 7 (14) 9 
1 7 2 9 7 5 s 

8 6 1 2 0 4 7 | 

0 1 6 1 3 0 4 
3 2 6 5 5 3 6 

(15) 8 (16) 8 (17) 8 (18) 9 (19) 5 (20) 8 (21) 4 . 
5 0 2 0 s 9 8 

3 5 4 6 0 4 1 

9 7 6 5 7 1 0 | 

Be 2 7 8 1 3 3 | 

: 





What actual drill do these exercises provide? How can we estimate 
the strengthening of specific connections through use of the practice 


provided? } 
On first glance it would appear that the drill is evenly distributed, | 
for: ie 
Oappears 9 times. 5 appears 12 times. | 
lappears 8 times. ! 6 appears 11 times. 
2appears 9 times. 7 appears 12 times. 
3 appears 11 times. 8 appears 12 times. 
4 appears 12 times. Qappears 9 times. 


What is the practice afforded the specific connections in these exercises? 
Neglecting the difference between a ‘“‘seen to seen” and an ‘“‘unseen to 
seen”? number and further neglecting the appearance of numbers 


- 


I 





- AE GS Or oe er eee oe ie Se 
~- yf 


~~ 
x 


% 


= em dere 


116 


the columns upward: 


(unseen) in the tens column,! 


Taste I.—Speciric Practice Wen Apping Upwarp 
To be read, the number in the column is added to the n 


The Journal of Educational Psychology 





we get the following spread, when adding 


umber in the row. Thus 3 










































































































































































nage 












































is added to 4 twice, to 7 no times, to 2 once 
.| 2. ees ee 4 6 | 
ae line eee: Cx sem 1 | 
1 es 1 ! 
2 2 1 1 | 
3 4 3 1 2 2 
4 1 1 1 1 
5 1 2 3 1 | 
6 1 2 3 2 | 
Th 
8 1 1 1 | 
eh ak Gee ee ee | 





’ Seen to seen is when, in the exercise 6 


lrooen 


the 2 is added to the 3, unseen to seen is 
added to 4; neglecting numbers in the 
down, the unseen 13 (6 +- 7) is added 


when the resulting 5, held in mind is 
tens column would be when, now adding 
to 4, consider now the 3 + 4 only. 











ar pagy’ P82 ote, a, 


i 

2 
» 

t 
+ 
ig 
Ba 
‘ 

é 
t 










Drill in Arithmetic | 117 


Here the practice allows for 84 exercises of specific connections. 
Fifty different connections receive exercise, 60 per cent of the combina- 
tions that could be exercised are exercised. Twenty five of the 84 
chances are limited to 8 combinations, i.e., 30 per cent of the exercise 
goes to 10 per cent of the possible connections to be exercised. 


Tasie II.—SxHow1ne THE DistrRipuTion or Speciric Practice WHEN CoLUMNS 
ARE AppED DowNWARD ) 














To be read as Table I 
eT. fF] 8 Fre) 87 ey ta S41 es | 
0 3 ee oe 1 1 1 yy 21 | 





1 et a 1 ae Oe Lj ay .. 1 














































































2 EL ar eee ae 


a 
- 


so 





118 The Journal of Educational Psychology 


Here 61 of the possible 84 combinations receive exercise. Seventy- 
two per cent of the combinations which could be exercised are exercised. 
Now if the pupils add both upward and downward for checking 


purposes the exercise on specific combinations is as reported in Table 
IIT. 


TaBLe III.—Amount or EXERCISE ON Speciric CONNECTIONS WHEN COLUMNS 
ARE ADDED Boru Ways 

































































0 1 2 3 4 5 6 7 8 9 z= 
0 ‘a 1 3 oe 1 1 2 3 2 es 13 
1 - 3 a 1 1 - 1 1 bi 1 8 
2 3 4 2 1 1 3 14 : 
3 7 3 1 3 3 1 2 3 2 25 d 
4 3S ae 2 2 Ss ae 2 1 4 1 16 
5 3 2 3 2 2 1 1 3 17 
6 1 2 1 3 4 2 2 2 1 2 20 
7 1 1 1 1 5 1 2 3 1 16 
8 3 3 1 1 3 3 3 3 20 
9 5 1 1 a 2 2 1 Zz 3 3 20 
p> 18; 18; 12; 17; 20; 17); 17] 21) 16) 14 











Here there were 168 chances for specific connections. One hundred 
connections exhaust the table, but even then only 84 per cent of the 
possible connections receive exercise. In considering the frequency 
with which numbers are added to, 7.e., some number is added to 0, 1, 
2, 3, 4, etc., the sum of the columns of Table III show fairly even 
distribution. When we consider the frequency with which 0, 1, 2, 3, 
etc., are added to some number, the sums of the rows of Table III, 
we get uneven distribution of practice, the extremes being 1 added to 
some number 8 times, 3 added to some number 25 times. 

Roughly correlating the frequency with which numbers appear in 
the columns to be added, and the connections which are actually prac- 
ticed, we get a correlation of +0.35 between frequency of appearance 





of 
add 
are 


app 
of a 





app 
the 
litt 


cor 






























Drill in Arithmetic 119 


of numbers in the columns and frequency with which the numbers are 
added to some other number in the actual practice when columns 
are added upward. When columns are added down the correlation is 
+ 0.46. The correlation between the frequency with which numbers 
appear as the first number of the addition z plus y, and the frequency 
of appearance as the second number, is —0.53. 

The columns given at the beginning of this article were constructed 
off-hand, the only thing in mind was to have the several numbers 
appear on the drill sheet approximately equally often. The fact that 
the frequency of the specific drill really provided in adding bears but | 
little more than a chance relation to frequency of the several numbers 
as printed, may be due to unconscious preferences for certain number | 
combinations by the writer. ‘ 

An actual drill exercise from the Grade V section of one of the best e 
textbooks on arithmetic was similarly treated. The frequency with 
which numbers appear in the 24 columns is i] 


we Vv 








NuMBER FREQUENCY 
0 0 
1 0 
2 6 ) 
3 15 : | 
4 16 f 
5 23 i 
6 19 
7 14 | 
8 15 ; 
9 12 | 
The frequency with which some number is added to 0, 1, 2, 3, 4, f 
etc., is: ) 
Appine Up Appina Down | 
NUMBER FREQUENCY | NUMBER FREQUENCY . | 
0 0 0 0 ) 
1 0 1 0 : 
2 6 2 4 1 
3 14 3 13 f | 
4 16 4 10 if 
5 19 5 19 i 
6 14 6 15 aN 
7 10 7 12 F 
8 10 S 12 5 
9 7 9 11 i 









oe TR 
* 


ere es BE 6 2 eS ee RRA SRS 





120 The Journal of Educational Psychology 


-The frequency with which 0, 1, 2, 3, 4, etc., is added to some 
number is: 


Appinc Up Apping Down 
NuMBER FREQUENCY NUMBER FREQUENCY 
0 8 0 8 
1 6 1 6 - 
2 11 2 12 
3 8 3 6 
4 8 4 10 
5 10 5 10 
6 11 6 9 
7 10 7 8 
8 11 8 14 
9 12 9 13 


In these exercises there are 96 chances to exercise specific connec- 
tions. There are, of course, 100 combinations. 59 combinations 
are exercised or 61 per cent of the possible distribution of practice 
occurring in the adding up. In adding down 51 combinations are 
exercised or 52 per cent of the possible distribution of practice. 

In this drill material how good an index of distribution is the 
frequency with which the several numbers are printed? 

The correlations between frequency of appearance on the page and 
frequency with which the several numbers are added to some number 
are: 

Frequency in print and frequency in adding up +-0.19. 

Frequency in print and frequency in adding down +-0.00. 

Frequency in print and frequency in adding both up and down 
+.25. 

The contention here is not that every drill exercise should spread 
its exercise of specific connections impartially. In some instances 
it should. The contention here has to do with the construction and 
methods of estimating the true nature of a given drill piece of material. 

The above analysis leads to the following considerations: 

1. In constructing drill material the frequency with which different 
numbers appear in the columns for drill in addition is a poor index of 
the distribution of practice actually provided. 

2. Drill material for addition of type z plus y should contain 
proper distribution of the z and of the y. Properly distributing the 
x does not distribute the y. 

3. Construction of drill material, such as the columns used in this 
article, is a trial and error affair. The frequency and order of the 





nut 





the 










































































































just happens. 

In making the size of ¢ correct, analysis of the drill material after 
construction and then trial and error rearrangement is a step toward 
the best observation of the law of exercise in arithmetic. 


Drill in Arithmetic 121 
me . ; 
numbers should be worked over until upon analysis we get, not num- 
bers appearing in print with desired frequency, but considering the . 
unseen numbers also, numbers appearing in both z plus y positions as 
the following suggests: | 
ej2 (a) eh ers | es eer. s | 
0 S Be RG oy, ee 8 | é: | e t | 
1 t t | x | i | t 
2 t t | | a etc. | 
3 t t | 
. 4 aa ae | | 
° 5 t t : | 
e ea : 
6 Se | ! 
p women 4 — 
7 t t | 4 
oe eee \ 
8 t | ¢ | | | | if. 
, | | | 
9 t t | | | 
z 10¢ | 102 ae | 
| 
Here ¢ represents the number of times a seen number is added to | 
seen, and also unseen to seen, and where ¢ represents the number of | 
times the specific connection should be exercised because of its relative f 
difficulty and amount of previous practice upon it. Dr..Thorndike | 
has pointed out that now the size of ¢ in most arithmetic material, ) 
| 
; 


ee ee ee 





- 





2 ee ee EE. ee eo 
nn 
_ 
+ “ 


ai as sonal 












































A COMPARISON OF THE ESSAY AND THE OBJEC- 
TIVE TYPE OF EXAMINATIONS 


DONALD A. LAIRD 
University of Wyoming 


Much interest has been manifest of late months in the true-false, 
yes and no, blank filling types of examination as a means for class 
grading. The chief argument used is that they are time and labor 
saving, objective, correlate highly with the usual essay type of examina- 
tion and with intelligence scores. 

Whether or not they are measuring the same things is quite another 
question but it is one worth considering. This communication 
will report a test which was made to compare these two examination 
methods to find out just how closely they were measuring the same 
thing or things. 

A group of 54 students in elementary psychology were unexpectedly 
told to write an essay on “‘The Adrenals in Psychology.”” They were 
given unlimited time, told to write everything they knew about them. 
A week previously they had completed the study of these glands in 
their course. After the essays were completed the students were given 
a list of 28 questions covering the same ground completely as it has been 
presented to them. These questions were of the objective type so that 
they could be answered by 4 single word or phrase. Still, they were 
not of the “‘yes-no”’ type so that some of the answers could be guesses. 

We thus have two examinations on the same subject for comparison. 
The essay examination was checked over point by point to see how 
many of the 28 topics had been included in the essay on the adrenals. 
Then the two methods were compared by contrasting the number of 
points scored in the essay with the number of points scored on the same 
scale but in a radically different type of examination. 

The results are startling. By grading the essays on the basis of 
distribution of merit a high correlation might have been obtained 
between these two examinations. But when they are compared in 
the amount of information contained the correlation becomes low and 
the difference a very great one. 

The correlation (r) between the two forms of examination is +0.038 
with a PE of +0.087. In the essay examination the students knew 
approximately half as much as on the other, when checked off against 
the points that they had been given in the course of the subject 
under examination. A correlation on the basis of grades given might 
123 


LT A A AN 





a rn a 


LO RAE 


2 es 


ore 


ee ou : 
=" 


.% ’ 
. Po mapa n . 


aes eme ee 





124 The Journal of Educational Psychology 


have been high. But should the sparse sampling of the essay test be 
considered a true measure of the student’s knowledge when other 
tests show that he knows really twice as much about the subject as 
he has written? 

In some quarters the essay type of examination is in favor since 
it is assumed to test and develop the precious mental ability of organi- 
zation of materials. If such is true one would expect the more intelli- 
gent students to do better on the essay type than did the poor students. 
With them the correlation between the points scored on the two 
examinations should be higher than in the case of the students equipped 
with a lesser amount or an inferior kind of intelligence. 

To test this hypothesis the students were divided into two groups, 
those above and those below the average intelligence of the class. 
With the group of 27 students who were above the class average in 
score on the Thorndike Intelligence Test for High School Graduates 
the correlation (r) between the two forms of examination was +0.107 
with a PE of to +0.121. What little a correlation of this magnitude 
may mean might be construed to lend validity to the assumption that in 
the case of the more intelligent students, the essay type of examination 
is fairer in showing how much they really know about the subject than 


" is the case with their lower scoring brethren. 


SUMMARY 


A comparison was made of the showings of a group of students on 
two types of examination covering the same narrow subject. This 
comparison was made, not on the basis of literary merit or the distri- 
bution of the marks, but rather on the basis of the material presented 
on the topic under examination that the examinations gave evidence 
of the student having mastered. 

When thus compared it was found that: 

1. The average student knows twice as much of the subject when 
tested by an objective, information test as when tested by an essay 
type of examination. 

2. The correlation between the points that aré apparently known 
by the students on these two types of test is practically zero. : 

3. In the case of the students above the average intelligence a 


higher, but small, correlation is obtained which may indicate that the _ 


examination test becomes more of an intelligence test than an evalua~ 
tion of the materials gleaned from the content of the course. 














I 
7 
V 
i 




















w NOTES ON ARTICLES IN EDUCATIONAL 
s PSYCHOLOGY IN CURRENT ISSUES OF 
: > OTHER MAGAZINES — mn 
: REPORTED BY CECILE COLLOTON 


Department of Educational Psychology, The Lincoln School of Teachers College 


INTELLIGENCE TESTS 


Native and Acquired Mental Ability as Measured by the Terman Group Test of 
Mental Ability. Dudley W. Willard. School and Society, 1922, Dec. 30, 750- 
756. Results of retesting 216 high school pupils after an interval of 74 months 
with the Terman Group Test. Pearson r— =0.875. Half of growth measured 
is due to school training; half, to development of native capacity. 

Classification of Kindergarten Children for First Grade by Means of the Binet 
Scale. Charles B. Dawson. Journal of Educational Research, 1922, December, 
412-422. The advantages of classifying first grade children on the basis of mental 
age. Data on 2029 Betsait kindergarten children. 

Professor Terman’s Determinism; a Rejoinder. William C. Bagley. Journal 
of Educational Research, 1922, December, 371-385. Another statement of Dr. 
Bagley’s ideas on general intelligence and mental testing. 

Objective Measures of Intelligence in Relation to High School and College Admin- 
istration. Alexander C. Roberts. Educational Administration and Supervision, 
1922, December, 530-540. Brief summary of the development of intelligence 
tests. Their value and limitations in helping to solve administrative problems. 

The Mental Age of Adults: An Editorial. Frank N. Freeman. Journal of 
Educational Research, 1922, December, 441-444. A review of the arguments 
in the Terman-Lippmann controversy and a defense of the Stanford-Binet Revision. 

A Comparison of the Latin and Non-Latin Group in High School. Edith I. 
Newcomb. Teachers College Record, 1922, November, 412-423. A summary 
of the results obtained in the initial tests given, September, 1921, in over 100 high 
schools by the Classical League of America. 

The Intelligence of a Highly Selected Group. John E. Anderson. School and 
Society, 1922, Dec. 23, 723-725. Scores on Army Alpha rank—the Hotchkiss 
student group consistently higher than corresponding classes in high schools and 
colleges. Discussion of the factors in the selection of the group. 

The Correlation between Intelligence and Accomplishment Quotients. I. N. 
Madsen. School and Society, 1922, Dec. 16, 696-697. Explains why the cor- 
relation between IQ and AQ does not express a true relationship. 


EDUCATIONAL TESTS 


English Composition Scales in Use. Thomas Briggs. Teachers College 
Record, 1922, November, 423-452. Discusses composition scales in general. 
Reports an investigation on standards for promotion in written composition and 
agreement between teacher’s standards. A number of compositions ranging in 
scale values from 4.0 to 8.9 are appended. 


125 


ie a 
~~, = 












































ET ST RT ES ET eR Pe cererpe te 
SS YS a a 


pee 





I mC one 








126 The Journal of Educational Psychology 


Comparative Validity of the Hotz Scales and the Rugg-Clark Testis in Algebra. 
Eleanora Harris and Frederick S. Breed. Journal of Educational Research, 
1922, December, 393-411. A critical study and comparison of the Hotz Scales 
and the Rugg-Clark Tests in Algebra from the point of view of validity, economy 
of time, difficulties in giving, taking, and scoring, and diagnostic value. Nine 
tables and seven figures present the data. 

Prognosis Tests of Ability to Learn Foreign Languages. Thomas H. Briggs. 
Journal of Educational Research, 1922, December, 386-392. Describes a battery 
of tests designed to test special ability to learn foreign languages. Detailed data 
of results and suggestions for improving the tests are given. 

A New Analytic Sewing Scale. K. Murdoch. Teachers College Record, 
1922, November, 453-458. Describes the construction of a new sewing scale and 
its advantages. 

Investigations Concerning the Murdoch Sewing Scale. Clara M. Brown. Teach- 
ers College Record, 1922, November, 459-470. Gives evidence for the validity of 
the Murdoch scale and establishes norms. 


MISCELLANEOUS 


A Study of High School Failures and Their Causes. Harvey A. Smith. Edu- 
cational Administration and Supervision, 1922, December, 557-572. A study 
based on the official school records of over 300 high school pupils. Causes of fail- 
ure are classified and discussed. 

Size of Type as Related to Readability in the First Four Grades. J. Herbert 
Blackhurst. School and Society, 1922, Dec. 16, 697-700. Eighteen-point type 
most readable in Grades 3 and 4; 24-point type in Grades 1 and 2. 

A Scale for Measuring Habits and Practices in Health and Accident Prevention. 
E. George Payne. School and Society, 1923, Jan. 6, 25-27. Discussion of the 
construction of the scale and its value in use. : 

Determining Chronological Age in Decimal Parts of a Year. Herbert A. Toops. 
Journal of Educational Research, 1922, December, 438-440. A table for deter- 
mining the exact age of a subject upon a given test date. 
































os D 





NEW PUBLICATIONS IN EDUCATIONAL 
PSYCHOLOGY AND RELATED FIELDS OF 


P Sve EDUCATION _~ 











1. Psychology Applied to Economics.'—The chief object of this 
book (as stated on page 205) ‘“‘is to make contemporary psychological 
views more conveniently available to economic students, so that such 
students may work out any consequences which seem to them impor- 
tant.” Naturally, interest attaches primarily to the psychological 
views or questions which have most significance for economics, namely 
the ‘‘wants” or motives. But when the author began ‘‘to investigate 
the specifically economic motives, he found so little agreement on the 
fundamentals of social psychology involved that a reexamination of 
these fundamentals appeared to be necessary.”’ This reexamination 
occupies about 200 of the 300 pages of the book. The psychologist 
will no doubt have some difficulty in seeing why the student of econom- 
ics can get the contemporary psychological views only by being car- 
ried back to Aristotle and led up to the contemporary by way of 
Hobbes, Adam Smith, Jeremy Bentham, the Mills Bain etc. The lack 
of agreement upon fundamentals of social psychology is due not so 
much to errors in interpreting the psychology of the past, as it is to a 
dearth of definitely established facts. It is just such a body of fact 
that the present day psychologists are endeavoring to accumulate. 
The author of this book presents contemporary psychology in its 
relation to economics so admirably that it seems unfortunate that more 
of the book could not be used for that purpose. 

Stimulus and response are accepted as the basis units of human 
behavior, and the more cumbersome concepts such as “creative 
instinct” are analyzed into simpler though less magic terms. The list of 
instincts given by Woodworth in his “‘ Psychology”’ is adopted, as well 
as the same author’s concepts of special aptitudes and acquired drives. 

Freudian psychology is examined at some length and some useful 
material is salvaged from it, especially as to the potency of hidden 
motives. The concept of the conditioned reflex, recognized as a 
restatement of the old law of association, becomes a very useful means 
of elaborating the simple motives into new and more complex wants. 
oll Dickinson, Z. C., “Economic Motives.” Cambridge University) Press, 
127 


. 
anos mate en - 
: ! 


I 
Aca ; dtd te 
‘ 








128 The Journal of Educational Psychology 


Trial and error are seen to play their part in the establishment of social 
habits and customs. 

Native differences in intelligence and other capacities are granted 
and the necessity recognized of taking these differences into account 
more than is customary among economists in the treatment of thrift, 
providence and similar problems. 

The latter part of the book offers a brief psychological interpreta- 
tion of a number of economic terms such as utility, cost, interest, value, 
work, etc. The analysis of work is especially interesting. There 
need be no single work motive, but rather there are numerous motives, 
simple instinctive tendencies, aptitudes and acquired habits, varying 
from one individual to another and differently played upon from one 
set of work conditions to another. The manipulation of these elements 
of the complex work motive for better work and greater satisfaction in 
work is suggested as a possibility. 

This book should be read by students of economics not only because 
of its warnings against “undue expectations of psychological touch- 
stones,” but because it presents the facts of modern psychology in a 
familiar atmosphere of economic terms and illustrations, and effectively 
points out their application to economics. This economic setting 
may offer some difficulty to the psychological reader as a knowledge of 
the technical meaning of economic terms cannot be presupposed. 

A. T. PoFFrENBERGER, 
Columbia University. 


2. A Composite Photograph of Modern Psychological Tendencies.'\— 
This phrase is taken from the author’s preface. The fitness of this 
characterization may be judged by the abundant quotations from the 
writings of Holt, Crile, Tichener, Pillsbury, McDongall, Stiles and 
James. 

The first 27 chapters may be grouped under the following heads: 
The instincts, the senses, the emotions, mental processes, fatigue, and 
rest. In the 9 remaining chapters the author discusses such widely 
divergent topics as dreams, suggestion, hypnotism, the mind of the 
crowd, salesmanship, advertising, wit and humor, industrial psychol- 
ogy and mental hygiene. 

An unusual feature of the book is the abundance of half-tones 
illustrating laboratory tests and psychological apparatus in use. 

L. Z. 


1 Givler, Robert Chenault: “Psychology, The Science of Human Behavior.” 
New York, Harper and Brothers, 1922, pp. 382. 











